Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added read_u64 optimizations to big endian. #27

Merged
merged 1 commit into from
May 24, 2021

Conversation

Alexhuszagh
Copy link
Contributor

@Alexhuszagh Alexhuszagh commented May 21, 2021

Fixes #26.

Rationale

This should produce the same byte-code, and remove all endian-dependent codepaths, given that the following are true:

  1. u64::from_le and u64::to_le are no-ops on little-endian architectures.
  2. u64::from_le and u64::to_le are very cheap on big-endian architectures.
  3. ptr::read_unaligned and ptr::write_unaligned are identical to ptr::copy_nonoverlapping(src, dst, mem::size_of::<T>())

The first 2 are trivial to show that they're true:

  • to_le and from_le are no-ops on little-endian, and cheap on big-endian.

For 3, we can see that read_unaligned is effectively identical to ptr::copy_nonoverlapping(src, dst, mem::size_of::<T>()), as long as MaybeUninit compiles down to no instructions.

Using the following source, we can see they're identical (on little-endian systems):

use std::ptr;

pub fn write_u64_v1(bytes: &mut [u8], value: u64) {
    let src = &value as *const _ as *const u8;
    let dst = bytes.as_mut_ptr();
    unsafe { ptr::copy_nonoverlapping(src, dst, 8) };
}

pub fn write_u64_v2(bytes: &mut [u8], value: u64) {
    let dst = bytes.as_mut_ptr() as *mut u64;
    unsafe { ptr::write_unaligned(dst, u64::to_le(value)) };
}

pub fn read_u64_v1(bytes: &[u8]) -> u64 {
    let mut value = 0_u64;
    let src = bytes.as_ptr();
    let dst = &mut value as *mut _ as *mut u8;
    unsafe { ptr::copy_nonoverlapping(src, dst, 8) };
    value
}

pub fn read_u64_v2(bytes: &[u8]) -> u64 {
    let src = bytes.as_ptr() as *const u64;
    u64::from_le(unsafe { ptr::read_unaligned(src) })
}

Compiled with -C opt-level=3, we can see the x86_64 assembly is identical.

example::read_u64_v1:
        mov     rax, qword ptr [rdi]
        ret

example::read_u64_v2:
        mov     rax, qword ptr [rdi]
        ret

example::write_u64_v1:
        mov     qword ptr [rdi], rdx
        ret

example::write_u64_v2:
        mov     qword ptr [rdi], rdx
        ret

This also includes tests to ensure that both big-endian and little-endian systems read the bytes the same way.

Correctness Concerns

Should be non-existent, since as long as the value is read and written to the same native integer, then all the integer operations will produce the same result no matter the byte-order of the architecture. Tests using b"01234567" are included for both read_u64 and write_u64, which should confirm it produces the integer 0x3736353433323130. If we did not use to_le and from_le, we'd expect the opposite byte-order, or 0x3031323334353637 (which would correspond to bytes of b"76543210" in little-endian). In short, we've confirmed we've gotten the proper result, and we've provided a significant optimization for big-endian architectures, and simplified a few functions.

Alternatives

We could change all the masks and operations to check if the digits are correct to big-endian, however, this might require some additional effort to check correctness, and might require changes in many more locations. Since swapping the byte-order of an integer is effectively free in the grand scheme of things, this should be satisfactory.

Benchmarks

The benchmarks on big-endian are emulated via Qemu, and therefore should be taken with a grain of salt. However, the performance for little-endian systems is identical, and the (emulated) performance improves for big-endian systems.

Little-Endian (Native), read_u64

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.13    27.29    27.66    28.09    28.56    29.90    44.77 |
| lexical               75.72    76.36    76.86    77.39    78.48    80.79   100.68 |
| from_str             200.21   200.92   201.65   202.70   204.25   209.90   314.91 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            22.34    33.51    35.02    35.61    36.15    36.65    36.86 |
| lexical                9.93    12.39    12.75    12.92    13.01    13.10    13.21 |
| from_str               3.18     4.77     4.90     4.93     4.96     4.98     4.99 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           388.73   583.04   609.39   619.59   629.11   637.69   641.38 |
| lexical              172.84   215.56   221.78   224.87   226.42   227.89   229.81 |
| from_str              55.26    82.92    85.20    85.85    86.29    86.61    86.92 |
|                                                                                   |
=====================================================================================

Little-Endian (Native), master

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.12    27.30    27.66    28.01    28.41    29.42    36.89 |
| lexical               75.76    75.98    76.48    76.98    77.95    81.02    96.75 |
| from_str             200.38   201.01   201.69   202.55   204.46   209.63   230.14 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.10    34.00    35.20    35.70    36.16    36.63    36.88 |
| lexical               10.34    12.35    12.83    12.99    13.08    13.16    13.20 |
| from_str               4.35     4.77     4.89     4.94     4.96     4.98     4.99 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           471.66   591.63   612.61   621.26   629.17   637.35   641.71 |
| lexical              179.86   214.90   223.25   226.07   227.53   229.01   229.69 |
| from_str              75.61    83.04    85.11    85.91    86.28    86.57    86.84 |
|                                                                                   |
=====================================================================================

Big-Endian (powerpc-unknown-linux-gnu), read_u64

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           239.00   240.00   240.81   241.40   242.22   245.25   270.75 |
| lexical              600.54   603.88   607.95   614.02   617.52   629.30   859.01 |
| from_str            1318.93  1325.09  1328.26  1331.44  1335.09  1349.77  1497.10 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float             3.69     4.08     4.13     4.14     4.15     4.17     4.18 |
| lexical                1.16     1.59     1.62     1.63     1.64     1.66     1.67 |
| from_str               0.67     0.74     0.75     0.75     0.75     0.75     0.76 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            64.27    70.96    71.84    72.09    72.26    72.51    72.81 |
| lexical               20.26    27.67    28.18    28.34    28.62    28.82    28.98 |
| from_str              11.62    12.89    13.03    13.07    13.10    13.13    13.19 |
|                                                                                   |
=====================================================================================

Big-Endian (powerpc-unknown-linux-gnu), master

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           259.11   261.30   262.17   262.88   263.73   267.88   302.34 |
| lexical              613.42   614.97   616.32   617.60   619.06   624.77   672.53 |
| from_str            1319.05  1328.78  1351.89  1357.88  1361.90  1374.38  1481.66 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float             3.31     3.73     3.79     3.80     3.81     3.83     3.86 |
| lexical                1.49     1.60     1.62     1.62     1.62     1.63     1.63 |
| from_str               0.67     0.73     0.73     0.74     0.74     0.75     0.76 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            57.56    64.96    65.98    66.20    66.38    66.60    67.16 |
| lexical               25.87    27.86    28.11    28.18    28.23    28.30    28.37 |
| from_str              11.74    12.66    12.78    12.82    12.87    13.10    13.19 |
|                                                                                   |
=====================================================================================

src/number.rs Outdated
if is_8digits(v) {
*x = x
.wrapping_mul(1_0000_0000)
.wrapping_add(parse_8digits_le(v));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm just trying to wrap my head around to what exactly gets passed to parse_8digits_le() on big-endian arch and why does it actually work if it does...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it does work indeed, just isn't immediately obvious. Maybe _le() suffix should be removed then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should, thats my mistake. Ill fix it and recommit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldanor

Hm, I'm just trying to wrap my head around to what exactly gets passed to parse_8digits_le() on big-endian arch and why does it actually work if it does...

This works because of the following:

  1. The code all assumes the bytes are read in little-endian order. Say I have the string "12345678", in little-endian, I have now read this as 0x3837363534333231, or essentially, the string is reversed.
  2. The rest of the code is all pure numerical operations.

This is all the relevant code, besides try_readu64. There is only mathematical operations, that is, no matter what endian the architecture is, you get the same result as long as the integer passed to each function is the same. is_8digits(0x3837363534333231) will always produce the same result, as will parse_8digits(0x3837363534333231), since there is no endian-dependent code paths for either.

What matters, however, is that the bytes are read (or written) in little-endian order in all cases. So merely changing read_u64 should fix the issue in all cases.

#[inline]
pub fn is_8digits(v: u64) -> bool {
    let a = v.wrapping_add(0x4646_4646_4646_4646);
    let b = v.wrapping_sub(0x3030_3030_3030_3030);
    (a | b) & 0x8080_8080_8080_8080 == 0
}

#[inline]
fn parse_8digits(mut v: u64) -> u64 {
    const MASK: u64 = 0x0000_00FF_0000_00FF;
    const MUL1: u64 = 0x000F_4240_0000_0064;
    const MUL2: u64 = 0x0000_2710_0000_0001;
    v -= 0x3030_3030_3030_3030;
    v = (v * 10) + (v >> 8); // will not overflow, fits in 63 bits
    let v1 = (v & MASK).wrapping_mul(MUL1);
    let v2 = ((v >> 16) & MASK).wrapping_mul(MUL2);
    ((v1.wrapping_add(v2) >> 32) as u32) as u64
}

#[inline]
fn try_parse_8digits(s: &mut AsciiStr<'_>, x: &mut u64) {
    // may cause overflows, to be handled later
    if let Some(v) = s.try_read_u64() {
        if is_8digits(v) {
            *x = x
                .wrapping_mul(1_0000_0000)
                .wrapping_add(parse_8digits(v));
            s.step_by(8);
            if let Some(v) = s.try_read_u64() {
                if is_8digits(v) {
                    *x = x
                        .wrapping_mul(1_0000_0000)
                        .wrapping_add(parse_8digits(v));
                    s.step_by(8);
                }
            }
        }
    }
}

The only other endian-dependent path previously was this:

let v = s.read_u64();
if !is_8digits(v) {
    break;
}
d.digits[d.num_digits..].write_u64(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);

Let's say we have b"12345678", then s.read_u64() will produce 0x3837363534333231 since we read it in little-endian order on all architectures (using u64::from_le). Then, write_u64 will write 0x0807060504030201 as [1, 2, 3, 4, 5, 6, 7, 8] on all architectures, since we use u64::to_le internally. That is, we get the same input and output in all cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a simple program that proves the latter is true:

use std::ptr;

#[inline]
fn read_u64(bytes: &[u8]) -> u64 {
    debug_assert!(bytes.len() >= 8);
    let src = bytes.as_ptr() as *const u64;
    u64::from_le(unsafe { ptr::read_unaligned(src) })
}

#[inline]
fn write_u64(bytes: &mut [u8], value: u64) {
    debug_assert!(bytes.len() >= 8);
    let dst = bytes.as_mut_ptr() as *mut u64;
    unsafe { ptr::write_unaligned(dst, u64::to_le(value)) };
}

pub fn main() {
    let input = b"12345678";
    let value = read_u64(input);
    assert_eq!(value, 0x3837363534333231);
    let mut output = [0u8; 8];
    write_u64(&mut output, value - 0x3030_3030_3030_3030);
    assert_eq!(output, [1, 2, 3, 4, 5, 6, 7, 8]);
}

The former example should be much easier to conceptualize. This runs perfectly fine on both little-endian and big-endian systems.

I hope this makes sense?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much what I reproduced in my head as well, thanks for making it crystal clear and putting it in writing!

@aldanor
Copy link
Owner

aldanor commented May 24, 2021

Thanks, that's great! (and read/write functions are now quite neat and elegant)

(These were the only places left from the original codebase with endian-dependent code)

@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented May 24, 2021

Technically, 1 more thing: in decimal.rs, the following does not need to read to the native byteorder:

let v = s.read_u64();
if !is_8digits(v) {
    break;
}
d.digits[d.num_digits..].write_u64(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);

This is because the data isn't parsed, only read, converted from characters to digits, and then written. If this is done on big-endian systems without a byte-swap, we would get the following for b"12345678":

let v = s.read_u64_ne(); // BE: 0x3132333435363738, LE: 0x3837363534333231
if !is_8digits(v) {      // Always false, since we've just reversed the order of the characters in `v`.
    break;
}
// Always writes [1, 2, 3, 4, 5, 6, 7, 8];
d.digits[d.num_digits..].write_u64_ne(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);

In short, a fairly minor optimization could be read_u64_ne, read_u64_le, and write_u64_ne, where only read_u64_le would have a byte-swap. However, in practice, this at most omits a single, very fast instruction, on a code-path that's very rare and slow.

@aldanor
Copy link
Owner

aldanor commented May 24, 2021

So we're basically wasting two bswaps per 8 chars on BE on a slow path? Yea, can be probably ignored.

All in all, looks good, let's merge this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove Endian-Dependent Code Paths
2 participants