-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added read_u64 optimizations to big endian. #27
Conversation
src/number.rs
Outdated
if is_8digits(v) { | ||
*x = x | ||
.wrapping_mul(1_0000_0000) | ||
.wrapping_add(parse_8digits_le(v)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm just trying to wrap my head around to what exactly gets passed to parse_8digits_le()
on big-endian arch and why does it actually work if it does...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it does work indeed, just isn't immediately obvious. Maybe _le()
suffix should be removed then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should, thats my mistake. Ill fix it and recommit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm just trying to wrap my head around to what exactly gets passed to
parse_8digits_le()
on big-endian arch and why does it actually work if it does...
This works because of the following:
- The code all assumes the bytes are read in little-endian order. Say I have the string "12345678", in little-endian, I have now read this as
0x3837363534333231
, or essentially, the string is reversed. - The rest of the code is all pure numerical operations.
This is all the relevant code, besides try_readu64
. There is only mathematical operations, that is, no matter what endian the architecture is, you get the same result as long as the integer passed to each function is the same. is_8digits(0x3837363534333231)
will always produce the same result, as will parse_8digits(0x3837363534333231)
, since there is no endian-dependent code paths for either.
What matters, however, is that the bytes are read (or written) in little-endian order in all cases. So merely changing read_u64
should fix the issue in all cases.
#[inline]
pub fn is_8digits(v: u64) -> bool {
let a = v.wrapping_add(0x4646_4646_4646_4646);
let b = v.wrapping_sub(0x3030_3030_3030_3030);
(a | b) & 0x8080_8080_8080_8080 == 0
}
#[inline]
fn parse_8digits(mut v: u64) -> u64 {
const MASK: u64 = 0x0000_00FF_0000_00FF;
const MUL1: u64 = 0x000F_4240_0000_0064;
const MUL2: u64 = 0x0000_2710_0000_0001;
v -= 0x3030_3030_3030_3030;
v = (v * 10) + (v >> 8); // will not overflow, fits in 63 bits
let v1 = (v & MASK).wrapping_mul(MUL1);
let v2 = ((v >> 16) & MASK).wrapping_mul(MUL2);
((v1.wrapping_add(v2) >> 32) as u32) as u64
}
#[inline]
fn try_parse_8digits(s: &mut AsciiStr<'_>, x: &mut u64) {
// may cause overflows, to be handled later
if let Some(v) = s.try_read_u64() {
if is_8digits(v) {
*x = x
.wrapping_mul(1_0000_0000)
.wrapping_add(parse_8digits(v));
s.step_by(8);
if let Some(v) = s.try_read_u64() {
if is_8digits(v) {
*x = x
.wrapping_mul(1_0000_0000)
.wrapping_add(parse_8digits(v));
s.step_by(8);
}
}
}
}
}
The only other endian-dependent path previously was this:
let v = s.read_u64();
if !is_8digits(v) {
break;
}
d.digits[d.num_digits..].write_u64(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);
Let's say we have b"12345678"
, then s.read_u64()
will produce 0x3837363534333231
since we read it in little-endian order on all architectures (using u64::from_le
). Then, write_u64
will write 0x0807060504030201
as [1, 2, 3, 4, 5, 6, 7, 8]
on all architectures, since we use u64::to_le
internally. That is, we get the same input and output in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a simple program that proves the latter is true:
use std::ptr;
#[inline]
fn read_u64(bytes: &[u8]) -> u64 {
debug_assert!(bytes.len() >= 8);
let src = bytes.as_ptr() as *const u64;
u64::from_le(unsafe { ptr::read_unaligned(src) })
}
#[inline]
fn write_u64(bytes: &mut [u8], value: u64) {
debug_assert!(bytes.len() >= 8);
let dst = bytes.as_mut_ptr() as *mut u64;
unsafe { ptr::write_unaligned(dst, u64::to_le(value)) };
}
pub fn main() {
let input = b"12345678";
let value = read_u64(input);
assert_eq!(value, 0x3837363534333231);
let mut output = [0u8; 8];
write_u64(&mut output, value - 0x3030_3030_3030_3030);
assert_eq!(output, [1, 2, 3, 4, 5, 6, 7, 8]);
}
The former example should be much easier to conceptualize. This runs perfectly fine on both little-endian and big-endian systems.
I hope this makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much what I reproduced in my head as well, thanks for making it crystal clear and putting it in writing!
Thanks, that's great! (and read/write functions are now quite neat and elegant) (These were the only places left from the original codebase with endian-dependent code) |
Technically, 1 more thing: in let v = s.read_u64();
if !is_8digits(v) {
break;
}
d.digits[d.num_digits..].write_u64(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8); This is because the data isn't parsed, only read, converted from characters to digits, and then written. If this is done on big-endian systems without a byte-swap, we would get the following for let v = s.read_u64_ne(); // BE: 0x3132333435363738, LE: 0x3837363534333231
if !is_8digits(v) { // Always false, since we've just reversed the order of the characters in `v`.
break;
}
// Always writes [1, 2, 3, 4, 5, 6, 7, 8];
d.digits[d.num_digits..].write_u64_ne(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8); In short, a fairly minor optimization could be |
So we're basically wasting two bswaps per 8 chars on BE on a slow path? Yea, can be probably ignored. All in all, looks good, let's merge this then. |
Fixes #26.
Rationale
This should produce the same byte-code, and remove all endian-dependent codepaths, given that the following are true:
u64::from_le
andu64::to_le
are no-ops on little-endian architectures.u64::from_le
andu64::to_le
are very cheap on big-endian architectures.ptr::read_unaligned
andptr::write_unaligned
are identical toptr::copy_nonoverlapping(src, dst, mem::size_of::<T>())
The first 2 are trivial to show that they're true:
For 3, we can see that read_unaligned is effectively identical to
ptr::copy_nonoverlapping(src, dst, mem::size_of::<T>())
, as long asMaybeUninit
compiles down to no instructions.Using the following source, we can see they're identical (on little-endian systems):
Compiled with
-C opt-level=3
, we can see the x86_64 assembly is identical.This also includes tests to ensure that both big-endian and little-endian systems read the bytes the same way.
Correctness Concerns
Should be non-existent, since as long as the value is read and written to the same native integer, then all the integer operations will produce the same result no matter the byte-order of the architecture. Tests using
b"01234567"
are included for bothread_u64
andwrite_u64
, which should confirm it produces the integer0x3736353433323130
. If we did not useto_le
andfrom_le
, we'd expect the opposite byte-order, or0x3031323334353637
(which would correspond to bytes ofb"76543210"
in little-endian). In short, we've confirmed we've gotten the proper result, and we've provided a significant optimization for big-endian architectures, and simplified a few functions.Alternatives
We could change all the masks and operations to check if the digits are correct to big-endian, however, this might require some additional effort to check correctness, and might require changes in many more locations. Since swapping the byte-order of an integer is effectively free in the grand scheme of things, this should be satisfactory.
Benchmarks
The benchmarks on big-endian are emulated via Qemu, and therefore should be taken with a grain of salt. However, the performance for little-endian systems is identical, and the (emulated) performance improves for big-endian systems.
Little-Endian (Native),
read_u64
Little-Endian (Native),
master
Big-Endian (powerpc-unknown-linux-gnu),
read_u64
Big-Endian (powerpc-unknown-linux-gnu),
master