-
-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fill_via_chunks
without using zerocopy or unsafe
.
#1575
base: master
Are you sure you want to change the base?
Conversation
f559d51
to
8fe7fcc
Compare
I looked at the generated x86_64 assembly of my original niave implementation:
It didn't get compiled into a single memcpy. So I've updated the PR with another implementation:
This alternative implementation doesn't result in a call to memcpy at all. I am on a laptop that isn't reliable enough performance-wise to run benchmarks on. Notably, in both of these implementations, there are no calls to panic functions in the generated assembly, unlike the present (before the PR) code. This is a benefit to certain applications. Regardless, I think it's just a matter of code golf to find a reasonable implementation, if this one isn't it. |
8fe7fcc
to
4743a47
Compare
here's a method which uses a single memcpy for most of the data but a manual method for the remainder for (chunk, src) in dest.chunks_exact_mut(SIZE).zip(&mut *src) {
chunk.copy_from_slice(&src.to_le_bytes());
}
// using src.iter().by_ref() doesn't produce a memcpy
let n = (dest.len() / SIZE) * SIZE;
for (dest, src) in dest
.chunks_exact_mut(SIZE)
.into_remainder()
.iter_mut()
.zip(src.iter().flat_map(|x| x.to_le_bytes()).skip(n))
{
*dest = src;
} |
The above won't compile as pub fn fill_via_chunks<T: Observable>(src: &mut [T], dest: &mut [u8]) -> (usize, usize) {
let size = core::mem::size_of::<T>();
let byte_len = min(core::mem::size_of_val(src), dest.len());
let num_chunks = (byte_len + size - 1) / size;
let n = (dest.len() / size) * size;
dest.chunks_exact_mut(size)
.zip(&mut *src)
.for_each(|(chunk, src)| {
chunk
.iter_mut()
.zip(src.to_le_bytes())
.for_each(|(dest, src)| {
*dest = src;
})
});
dest.chunks_exact_mut(size)
.into_remainder()
.iter_mut()
.zip(src.iter().flat_map(|x| x.to_le_bytes()).skip(n))
.for_each(|(dest, src)| {
*dest = src;
});
(num_chunks, byte_len)
} Here's the godbolt result of the above code: https://rust.godbolt.org/z/GfWxb9nzG in case folks want to see for themselves or poke around further. Basically, |
Right thank you I didn't abstract enough. Also I found this tail method produces better codegen: let chunks = dest.chunks_exact_mut(size);
if let Some(src) = src.get(chunks.len()) {
chunks
.into_remainder()
.iter_mut()
.zip(src.to_le_bytes())
.for_each(|(dest, src)| {
*dest = src;
});
} https://rust.godbolt.org/z/s95dWrafe But I don't know much about zerocopy so I couldn't speak to any performance change from dropping it |
I think @Bluefinger is on the right track with fn fill_via_chunks<T: Observable>(src: &[T], dest: &mut [u8]) -> (usize, usize) {
let size = core::mem::size_of::<T>();
let mut dest_it = dest.chunks_exact_mut(size);
let it = dest_it.by_ref().zip(src);
let num_chunks = it.len();
let byte_len = num_chunks * size;
for (dest_chunk, src_chunk) in it {
dest_chunk.copy_from_slice(src_chunk.to_le_bytes().as_ref());
}
let dest_tail = dest_it.into_remainder();
let n = dest_tail.len();
if n > 0 && src.len() > num_chunks {
dest_tail.copy_from_slice(&src[num_chunks].to_le_bytes().as_ref()[..n]);
(num_chunks + 1, byte_len + n)
} else {
(num_chunks, byte_len)
}
} I honestly think this code is more readable than the existing code using zerocopy, mostly because it uses the iterator methods to do most of the This gets compiled fairly well (Godbolt Link) as one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only use of zerocopy
in rand_core
, so we can remove it from rand_core/Cargo.toml
.
f867f03
to
4e5b6b2
Compare
Thanks everybody! I updated the PR with a version that uses your ideas. I am still frustrated that it compiles to two memcpys instead of one. |
Also, in this new version, the compiler is still able to statically verify that there are no panics (no calls into panic infrastructure in the generated assembly). |
Reduce the use of `zerocopy` as a step towards eliminating the dependency.
4e5b6b2
to
7a14dfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I have a few style nits, but they are equivalent to your code, so feel free to ignore them if you prefer how its currently written.
|
||
let zipped = dest.by_ref().zip(src.by_ref()); | ||
let num_chunks = zipped.len(); | ||
zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It seems like this crate generally prefers for
loops to for_each
, is there a reason to use for_each
here? Is it because of variable shadowing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler doesn't optimise the same for for
loops vs for_each
, and for ensuring certain optimisations take place, you need to use for_each
. In like 90% of cases, for
loops are just fine, but when targeting particular optimisation cases, for_each
is usually what you need to reach for, particularly when you are trying to get autovectorisation to kick in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I have nothing against usage of for_each
in this crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a reference: https://rust.godbolt.org/z/3qfned8vn
This shows the compiled output diff between the for
loop and for_each
. The for_each
is able to optimise to the two memcpy
ops, whereas the for
optimises to a single memcpy
case with a separate path that does not have memcpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. Do we want the tail-case to use memcpy? In the worst case it's only 7 bytes
zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref())); | ||
|
||
let byte_len = num_chunks * size; | ||
if let (dest_tail @ [_, ..], Some(src)) = (dest.into_remainder(), src.next()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I find the if let
pattern here a little confusing, could we just do:
if let Some(src) = src.next() {
// We have consumed all full chunks of dest, but not src.
let dest = dest.into_remainder();
let n = dest.len();
if n > 0 {
dest.copy_from_slice(&src.to_le_bytes().as_ref()[..n]);
return (num_chunks + 1, byte_len + n)
}
}
(num_chunks, byte_len)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of a slice match pattern to check for a non-empty tail is not so obvious, no.
This alternative compiles to identical assembly so I'll take it by preference.
pub fn fill_via_u32_chunks(src: &mut [u32], dest: &mut [u8]) -> (usize, usize) { | ||
// TODO(SemVer): src: `&[u32]` as we don't mutate it. | ||
fill_via_chunks(src, dest) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone know why this method is public at all? It is used in the impl of RngCore for BlockRng<R: BlockRngCore>
, which is crate-local. I doubt RNGs would want to use it directly. It looks like this pub fn
pre-dates BlockRng
.
I suggest that we deprecate these two methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering that myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do:
- Alternative code as suggested above
- Deprecate methods as noted above
- Add a note to the changelog, under
[Unreleased]
heading - (Optional): do not use
memcpy
for tail case
After that I'm happy with this: the code is clear enough, at most two usages of memcpy
seems fine, benchmark results are fine.
zipped.for_each(|(dest, src)| dest.copy_from_slice(src.to_le_bytes().as_ref())); | ||
|
||
let byte_len = num_chunks * size; | ||
if let (dest_tail @ [_, ..], Some(src)) = (dest.into_remainder(), src.next()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of a slice match pattern to check for a non-empty tail is not so obvious, no.
This alternative compiles to identical assembly so I'll take it by preference.
CHANGELOG.md
entrySummary
Reduce use of
zerocopy
.Motivation
Make progress towards reducing the set of dependencies.
Details