Added read_u64 optimizations to big endian. #27

Alexhuszagh · 2021-05-21T22:22:04Z

Fixes #26.

Rationale

This should produce the same byte-code, and remove all endian-dependent codepaths, given that the following are true:

u64::from_le and u64::to_le are no-ops on little-endian architectures.
u64::from_le and u64::to_le are very cheap on big-endian architectures.
ptr::read_unaligned and ptr::write_unaligned are identical to ptr::copy_nonoverlapping(src, dst, mem::size_of::<T>())

The first 2 are trivial to show that they're true:

to_le and from_le are no-ops on little-endian, and cheap on big-endian.

For 3, we can see that read_unaligned is effectively identical to ptr::copy_nonoverlapping(src, dst, mem::size_of::<T>()), as long as MaybeUninit compiles down to no instructions.

Using the following source, we can see they're identical (on little-endian systems):

use std::ptr;

pub fn write_u64_v1(bytes: &mut [u8], value: u64) {
    let src = &value as *const _ as *const u8;
    let dst = bytes.as_mut_ptr();
    unsafe { ptr::copy_nonoverlapping(src, dst, 8) };
}

pub fn write_u64_v2(bytes: &mut [u8], value: u64) {
    let dst = bytes.as_mut_ptr() as *mut u64;
    unsafe { ptr::write_unaligned(dst, u64::to_le(value)) };
}

pub fn read_u64_v1(bytes: &[u8]) -> u64 {
    let mut value = 0_u64;
    let src = bytes.as_ptr();
    let dst = &mut value as *mut _ as *mut u8;
    unsafe { ptr::copy_nonoverlapping(src, dst, 8) };
    value
}

pub fn read_u64_v2(bytes: &[u8]) -> u64 {
    let src = bytes.as_ptr() as *const u64;
    u64::from_le(unsafe { ptr::read_unaligned(src) })
}

Compiled with -C opt-level=3, we can see the x86_64 assembly is identical.

example::read_u64_v1:
        mov     rax, qword ptr [rdi]
        ret

example::read_u64_v2:
        mov     rax, qword ptr [rdi]
        ret

example::write_u64_v1:
        mov     qword ptr [rdi], rdx
        ret

example::write_u64_v2:
        mov     qword ptr [rdi], rdx
        ret

This also includes tests to ensure that both big-endian and little-endian systems read the bytes the same way.

Correctness Concerns

Should be non-existent, since as long as the value is read and written to the same native integer, then all the integer operations will produce the same result no matter the byte-order of the architecture. Tests using b"01234567" are included for both read_u64 and write_u64, which should confirm it produces the integer 0x3736353433323130. If we did not use to_le and from_le, we'd expect the opposite byte-order, or 0x3031323334353637 (which would correspond to bytes of b"76543210" in little-endian). In short, we've confirmed we've gotten the proper result, and we've provided a significant optimization for big-endian architectures, and simplified a few functions.

Alternatives

We could change all the masks and operations to check if the digits are correct to big-endian, however, this might require some additional effort to check correctness, and might require changes in many more locations. Since swapping the byte-order of an integer is effectively free in the grand scheme of things, this should be satisfactory.

Benchmarks

The benchmarks on big-endian are emulated via Qemu, and therefore should be taken with a grain of salt. However, the performance for little-endian systems is identical, and the (emulated) performance improves for big-endian systems.

Little-Endian (Native), `read_u64`

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.13    27.29    27.66    28.09    28.56    29.90    44.77 |
| lexical               75.72    76.36    76.86    77.39    78.48    80.79   100.68 |
| from_str             200.21   200.92   201.65   202.70   204.25   209.90   314.91 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            22.34    33.51    35.02    35.61    36.15    36.65    36.86 |
| lexical                9.93    12.39    12.75    12.92    13.01    13.10    13.21 |
| from_str               3.18     4.77     4.90     4.93     4.96     4.98     4.99 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           388.73   583.04   609.39   619.59   629.11   637.69   641.38 |
| lexical              172.84   215.56   221.78   224.87   226.42   227.89   229.81 |
| from_str              55.26    82.92    85.20    85.85    86.29    86.61    86.92 |
|                                                                                   |
=====================================================================================

Little-Endian (Native), `master`

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.12    27.30    27.66    28.01    28.41    29.42    36.89 |
| lexical               75.76    75.98    76.48    76.98    77.95    81.02    96.75 |
| from_str             200.38   201.01   201.69   202.55   204.46   209.63   230.14 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            27.10    34.00    35.20    35.70    36.16    36.63    36.88 |
| lexical               10.34    12.35    12.83    12.99    13.08    13.16    13.20 |
| from_str               4.35     4.77     4.89     4.94     4.96     4.98     4.99 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           471.66   591.63   612.61   621.26   629.17   637.35   641.71 |
| lexical              179.86   214.90   223.25   226.07   227.53   229.01   229.69 |
| from_str              75.61    83.04    85.11    85.91    86.28    86.57    86.84 |
|                                                                                   |
=====================================================================================

Big-Endian (powerpc-unknown-linux-gnu), `read_u64`

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           239.00   240.00   240.81   241.40   242.22   245.25   270.75 |
| lexical              600.54   603.88   607.95   614.02   617.52   629.30   859.01 |
| from_str            1318.93  1325.09  1328.26  1331.44  1335.09  1349.77  1497.10 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float             3.69     4.08     4.13     4.14     4.15     4.17     4.18 |
| lexical                1.16     1.59     1.62     1.63     1.64     1.66     1.67 |
| from_str               0.67     0.74     0.75     0.75     0.75     0.75     0.76 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            64.27    70.96    71.84    72.09    72.26    72.51    72.81 |
| lexical               20.26    27.67    28.18    28.34    28.62    28.82    28.98 |
| from_str              11.62    12.89    13.03    13.07    13.10    13.13    13.19 |
|                                                                                   |
=====================================================================================

Big-Endian (powerpc-unknown-linux-gnu), `master`

=====================================================================================
|                         canada.txt (111126, 1.93 MB, f64)                         |
|===================================================================================|
|                                                                                   |
| ns/float                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float           259.11   261.30   262.17   262.88   263.73   267.88   302.34 |
| lexical              613.42   614.97   616.32   617.60   619.06   624.77   672.53 |
| from_str            1319.05  1328.78  1351.89  1357.88  1361.90  1374.38  1481.66 |
|                                                                                   |
| Mfloat/s                min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float             3.31     3.73     3.79     3.80     3.81     3.83     3.86 |
| lexical                1.49     1.60     1.62     1.62     1.62     1.63     1.63 |
| from_str               0.67     0.73     0.73     0.74     0.74     0.75     0.76 |
|                                                                                   |
| MB/s                    min       5%      25%   median      75%      95%      max |
|-----------------------------------------------------------------------------------|
| fast-float            57.56    64.96    65.98    66.20    66.38    66.60    67.16 |
| lexical               25.87    27.86    28.11    28.18    28.23    28.30    28.37 |
| from_str              11.74    12.66    12.78    12.82    12.87    13.10    13.19 |
|                                                                                   |
=====================================================================================

aldanor · 2021-05-23T23:09:42Z

src/number.rs

+        if is_8digits(v) {
+            *x = x
+                .wrapping_mul(1_0000_0000)
+                .wrapping_add(parse_8digits_le(v));


Hm, I'm just trying to wrap my head around to what exactly gets passed to parse_8digits_le() on big-endian arch and why does it actually work if it does...

I guess it does work indeed, just isn't immediately obvious. Maybe _le() suffix should be removed then?

Yes it should, thats my mistake. Ill fix it and recommit.

@aldanor

Hm, I'm just trying to wrap my head around to what exactly gets passed to parse_8digits_le() on big-endian arch and why does it actually work if it does...

This works because of the following:

The code all assumes the bytes are read in little-endian order. Say I have the string "12345678", in little-endian, I have now read this as 0x3837363534333231, or essentially, the string is reversed.

The rest of the code is all pure numerical operations.

This is all the relevant code, besides try_readu64. There is only mathematical operations, that is, no matter what endian the architecture is, you get the same result as long as the integer passed to each function is the same. is_8digits(0x3837363534333231) will always produce the same result, as will parse_8digits(0x3837363534333231), since there is no endian-dependent code paths for either.

What matters, however, is that the bytes are read (or written) in little-endian order in all cases. So merely changing read_u64 should fix the issue in all cases.

#[inline] pub fn is_8digits(v: u64) -> bool { let a = v.wrapping_add(0x4646_4646_4646_4646); let b = v.wrapping_sub(0x3030_3030_3030_3030); (a | b) & 0x8080_8080_8080_8080 == 0 } #[inline] fn parse_8digits(mut v: u64) -> u64 { const MASK: u64 = 0x0000_00FF_0000_00FF; const MUL1: u64 = 0x000F_4240_0000_0064; const MUL2: u64 = 0x0000_2710_0000_0001; v -= 0x3030_3030_3030_3030; v = (v * 10) + (v >> 8); // will not overflow, fits in 63 bits let v1 = (v & MASK).wrapping_mul(MUL1); let v2 = ((v >> 16) & MASK).wrapping_mul(MUL2); ((v1.wrapping_add(v2) >> 32) as u32) as u64 } #[inline] fn try_parse_8digits(s: &mut AsciiStr<'_>, x: &mut u64) { // may cause overflows, to be handled later if let Some(v) = s.try_read_u64() { if is_8digits(v) { *x = x .wrapping_mul(1_0000_0000) .wrapping_add(parse_8digits(v)); s.step_by(8); if let Some(v) = s.try_read_u64() { if is_8digits(v) { *x = x .wrapping_mul(1_0000_0000) .wrapping_add(parse_8digits(v)); s.step_by(8); } } } } }

The only other endian-dependent path previously was this:

let v = s.read_u64(); if !is_8digits(v) { break; } d.digits[d.num_digits..].write_u64(v - 0x3030_3030_3030_3030); d.num_digits += 8; s = s.advance(8);

Let's say we have b"12345678", then s.read_u64() will produce 0x3837363534333231 since we read it in little-endian order on all architectures (using u64::from_le). Then, write_u64 will write 0x0807060504030201 as [1, 2, 3, 4, 5, 6, 7, 8] on all architectures, since we use u64::to_le internally. That is, we get the same input and output in all cases.

Here's a simple program that proves the latter is true:

use std::ptr; #[inline] fn read_u64(bytes: &[u8]) -> u64 { debug_assert!(bytes.len() >= 8); let src = bytes.as_ptr() as *const u64; u64::from_le(unsafe { ptr::read_unaligned(src) }) } #[inline] fn write_u64(bytes: &mut [u8], value: u64) { debug_assert!(bytes.len() >= 8); let dst = bytes.as_mut_ptr() as *mut u64; unsafe { ptr::write_unaligned(dst, u64::to_le(value)) }; } pub fn main() { let input = b"12345678"; let value = read_u64(input); assert_eq!(value, 0x3837363534333231); let mut output = [0u8; 8]; write_u64(&mut output, value - 0x3030_3030_3030_3030); assert_eq!(output, [1, 2, 3, 4, 5, 6, 7, 8]); }

The former example should be much easier to conceptualize. This runs perfectly fine on both little-endian and big-endian systems.

I hope this makes sense?

This is pretty much what I reproduced in my head as well, thanks for making it crystal clear and putting it in writing!

aldanor · 2021-05-24T00:08:53Z

Thanks, that's great! (and read/write functions are now quite neat and elegant)

(These were the only places left from the original codebase with endian-dependent code)

Alexhuszagh · 2021-05-24T03:15:49Z

Technically, 1 more thing: in decimal.rs, the following does not need to read to the native byteorder:

let v = s.read_u64();
if !is_8digits(v) {
    break;
}
d.digits[d.num_digits..].write_u64(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);

This is because the data isn't parsed, only read, converted from characters to digits, and then written. If this is done on big-endian systems without a byte-swap, we would get the following for b"12345678":

let v = s.read_u64_ne(); // BE: 0x3132333435363738, LE: 0x3837363534333231
if !is_8digits(v) {      // Always false, since we've just reversed the order of the characters in `v`.
    break;
}
// Always writes [1, 2, 3, 4, 5, 6, 7, 8];
d.digits[d.num_digits..].write_u64_ne(v - 0x3030_3030_3030_3030);
d.num_digits += 8;
s = s.advance(8);

In short, a fairly minor optimization could be read_u64_ne, read_u64_le, and write_u64_ne, where only read_u64_le would have a byte-swap. However, in practice, this at most omits a single, very fast instruction, on a code-path that's very rare and slow.

aldanor · 2021-05-24T21:22:25Z

So we're basically wasting two bswaps per 8 chars on BE on a slow path? Yea, can be probably ignored.

All in all, looks good, let's merge this then.

aldanor reviewed May 23, 2021

View reviewed changes

Added read_u64 optimizations to big endian.

bc9ed2e

Alexhuszagh force-pushed the read_u64 branch from b2c074d to bc9ed2e Compare May 24, 2021 01:40

aldanor merged commit ec1b7d4 into aldanor:master May 24, 2021

ydongyeon mentioned this pull request Oct 13, 2024

Null Pointer Dereference Vulnerability in AsciiStr Struct #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added read_u64 optimizations to big endian. #27

Added read_u64 optimizations to big endian. #27

Alexhuszagh commented May 21, 2021 •

edited

Loading

aldanor May 23, 2021

aldanor May 24, 2021

Alexhuszagh May 24, 2021

Alexhuszagh May 24, 2021

Alexhuszagh May 24, 2021

aldanor May 24, 2021

aldanor commented May 24, 2021

Alexhuszagh commented May 24, 2021 •

edited

Loading

aldanor commented May 24, 2021

Added read_u64 optimizations to big endian. #27

Added read_u64 optimizations to big endian. #27

Conversation

Alexhuszagh commented May 21, 2021 • edited Loading

Rationale

Correctness Concerns

Alternatives

Benchmarks

Little-Endian (Native), read_u64

Little-Endian (Native), master

Big-Endian (powerpc-unknown-linux-gnu), read_u64

Big-Endian (powerpc-unknown-linux-gnu), master

aldanor May 23, 2021

Choose a reason for hiding this comment

aldanor May 24, 2021

Choose a reason for hiding this comment

Alexhuszagh May 24, 2021

Choose a reason for hiding this comment

Alexhuszagh May 24, 2021

Choose a reason for hiding this comment

Alexhuszagh May 24, 2021

Choose a reason for hiding this comment

aldanor May 24, 2021

Choose a reason for hiding this comment

aldanor commented May 24, 2021

Alexhuszagh commented May 24, 2021 • edited Loading

aldanor commented May 24, 2021

Alexhuszagh commented May 21, 2021 •

edited

Loading

Little-Endian (Native), `read_u64`

Little-Endian (Native), `master`

Big-Endian (powerpc-unknown-linux-gnu), `read_u64`

Big-Endian (powerpc-unknown-linux-gnu), `master`

Alexhuszagh commented May 24, 2021 •

edited

Loading