Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a known-patchable sequence for rdtsc trapping #3580

Merged
merged 1 commit into from
Aug 19, 2023

Conversation

Keno
Copy link
Member

@Keno Keno commented Aug 18, 2023

I was looking some disassembly that looked like this:

   0x00007f5d7d0e7ffd <+29>:	48 89 7c 24 10	mov    %rdi,0x10(%rsp)
   0x00007f5d7d0e8002 <+34>:	4c 8d 64 24 20	lea    0x20(%rsp),%r12
   0x00007f5d7d0e8007 <+39>:	eb 37	jmp    0x7f5d7d0e8040 <_ZN5tracy8Profiler14CalibrateDelayEv+96>
   0x00007f5d7d0e8009 <+41>:	0f 1f 80 00 00 00 00	nopl   0x0(%rax)
   0x00007f5d7d0e8010 <+48>:	e9 03 8b 1a 3c	jmpq   0x7f5db9290b18
   0x00007f5d7d0e8015 <+53>:	90	nop
   0x00007f5d7d0e8016 <+54>:	4c 8d 2c 02	lea    (%rdx,%rax,1),%r13
   0x00007f5d7d0e801a <+58>:	e8 31 60 fc ff	callq  0x7f5d7d0ae050 <_ZN5tracy28HardwareSupportsInvariantTSCEv@plt>
   0x00007f5d7d0e801f <+63>:	84 c0	test   %al,%al
   0x00007f5d7d0e8021 <+65>:	74 4a	je     0x7f5d7d0e806d <_ZN5tracy8Profiler14CalibrateDelayEv+141>
=> 0x00007f5d7d0e8023 <+67>:	0f 31	rdtsc
   0x00007f5d7d0e8025 <+69>:	48 c1 e2 20	shl    $0x20,%rdx

We do have a syscallbuf template for shl $0x20,%rdx. Irritatingly, however, the 7c 24 at the top trigger's rr's interfering branch heuristic by false-positive coincidence. Even more unfortunately, this is a calibration loop for a tracing library, so it does actually just call rdtsc in a loop a bunch. Additionally this is shipped as a binary, so every user is hitting this unfortunate coincidence. Of course, since we're shipping the library, I can suggest we modify the source to make it more friendly to rr patching. That said, I wasn't quite sure what would be best to suggest here.

At first I thought, I could simply insert a bunch of nops after. Unfortunately, that doesn't actually do anything, because the branch could still be interfering into the middle of the nop sled.
Then I thought maybe a singular large nop would be sufficient, but of course, there could be something that intentionally conditionally skips the rdtsc.

In this PR, I propose that we make the following sequence known-safe:

nopl 0(%ax, %ax, 1) # single instruction, 5-byte nop
rdtsc

This currently wouldn't quite work, because an interfering jump to the rdtsc would simply hit the trailing nop padding, ignoring the rdtsc. However, if we slightly tweak the patch to instead use:

1: jmp %hook # returns at 2f as usual
[usual nop padding here]
jmp 1b
2:

Then everything works out well. Our return address is past the entire patch region, so we return to the correct place and it doesn't matter whether we jump to the nop or to the instruction itself. Of course, and actual interfering branch over the nop is unlikely (though I supposed not impossible depending on what comes before), since interfering branches are supported, this nicely takes care of the spurious branch problem above (assuming the extra 5-byte nop is inserted).

@Keno
Copy link
Member Author

Keno commented Aug 18, 2023

Note that the base branch of this is #3579, so that should be merged first, at which point GitHub should automatically retarget this branch.

@khuey
Copy link
Collaborator

khuey commented Aug 18, 2023

Rather than deal with this complexity, maybe we should choose a fixed non-canonical 5 byte NOP (the 5 byte NOP sequence has two bytes that are normally just 0 but could be anything we want) and use that to represent "I promise there are no interfering branches"?

@Keno
Copy link
Member Author

Keno commented Aug 18, 2023

I think we could do that, but I'm not sure it's any better. The implementation complexity for implementing that would be about the same and I don't really see any benefit over just making it work as in this PR. I think if we can avoid magic extra semantics, we should just do that.

@Keno Keno force-pushed the kf/interferingrdtsc branch from 9363924 to 22896db Compare August 19, 2023 00:45
I was looking some disassembly that looked like this:
```
   0x00007f5d7d0e7ffd <+29>:	48 89 7c 24 10	mov    %rdi,0x10(%rsp)
   0x00007f5d7d0e8002 <+34>:	4c 8d 64 24 20	lea    0x20(%rsp),%r12
   0x00007f5d7d0e8007 <+39>:	eb 37	jmp    0x7f5d7d0e8040 <_ZN5tracy8Profiler14CalibrateDelayEv+96>
   0x00007f5d7d0e8009 <+41>:	0f 1f 80 00 00 00 00	nopl   0x0(%rax)
   0x00007f5d7d0e8010 <+48>:	e9 03 8b 1a 3c	jmpq   0x7f5db9290b18
   0x00007f5d7d0e8015 <+53>:	90	nop
   0x00007f5d7d0e8016 <+54>:	4c 8d 2c 02	lea    (%rdx,%rax,1),%r13
   0x00007f5d7d0e801a <+58>:	e8 31 60 fc ff	callq  0x7f5d7d0ae050 <_ZN5tracy28HardwareSupportsInvariantTSCEv@plt>
   0x00007f5d7d0e801f <+63>:	84 c0	test   %al,%al
   0x00007f5d7d0e8021 <+65>:	74 4a	je     0x7f5d7d0e806d <_ZN5tracy8Profiler14CalibrateDelayEv+141>
=> 0x00007f5d7d0e8023 <+67>:	0f 31	rdtsc
   0x00007f5d7d0e8025 <+69>:	48 c1 e2 20	shl    $0x20,%rdx
```
We do have a syscallbuf template for `shl $0x20,%rdx`.
Irritatingly, however, the `7c 24` at the top trigger's rr's interfering branch heuristic
by false-positive coincidence. Even more unfortunately, this is a calibration loop
for a tracing library, so it does actually just call `rdtsc` in a loop a bunch.
Additionally this is shipped as a binary, so every user is hitting this unfortunate coincidence.
Of course, since we're shipping the library, I can suggest we modify the source to make
it more friendly to rr patching. That said, I wasn't quite sure what would be best to suggest here.

At first I thought, I could simply insert a bunch of nops after.
Unfortunately, that doesn't actually do anything, because the branch could still
be interfering into the middle of the nop sled.
Then I thought maybe a singular large `nop` would be sufficient, but of course,
there could be something that intentionally conditionally skips the rdtsc.

In this PR, I propose that we make the following sequence known-safe:
```
nopl 0(%ax, %ax, 1) # single instruction, 5-byte nop
rdtsc
```

This currently wouldn't quite work, because an interfering jump to the
`rdtsc` would simply hit the trailing nop padding, ignoring the rdtsc.
However, if we slightly tweak the patch to instead use:

```
1: jmp %hook
[usual nop padding here]
jmp 1b
2:
```

Then everything works out well. Our return address is past the entire patch
region, so we return to the correct place and it doesn't matter whether
we jump to the nop or to the instruction itself. Of course, and actual
interfering branch over the nop is unlikely (though I supposed not
impossible depending on what comes before), since interfering branches
are supported, this nicely takes care of the spurious branch problem
above (assuming the extra 5-byte nop is inserted).
@Keno Keno force-pushed the kf/saferdtscseq branch from 2a49b30 to e1818b3 Compare August 19, 2023 00:50
@Keno Keno changed the base branch from kf/interferingrdtsc to master August 19, 2023 00:50
@rocallahan rocallahan merged commit cc5cee0 into master Aug 19, 2023
Keno added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Aug 19, 2023
Several users have reported masssive slowdowns to rr record/replay
in packages that have the tracy client loaded. The issue turns out
to be an unlucky false positive in an rr heuristic that determines
the patchability of `rdtsc`. In rr master, it is possible to guarantee
patchability by using a specific `nopl; rdtsc` sequence. Add a patch
to tracy to do just that.

See also
rr-debugger/rr#3580
JuliaLang/julia#50975
Keno added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Aug 19, 2023
Several users have reported masssive slowdowns to rr record/replay
in packages that have the tracy client loaded. The issue turns out
to be an unlucky false positive in an rr heuristic that determines
the patchability of `rdtsc`. In rr master, it is possible to guarantee
patchability by using a specific `nopl; rdtsc` sequence. Add a patch
to tracy to do just that.

See also
rr-debugger/rr#3580
JuliaLang/julia#50975
Keno added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Aug 19, 2023
Several users have reported masssive slowdowns to rr record/replay
in packages that have the tracy client loaded. The issue turns out
to be an unlucky false positive in an rr heuristic that determines
the patchability of `rdtsc`. In rr master, it is possible to guarantee
patchability by using a specific `nopl; rdtsc` sequence. Add a patch
to tracy to do just that.

See also
rr-debugger/rr#3580
JuliaLang/julia#50975
Keno added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Aug 19, 2023
Several users have reported masssive slowdowns to rr record/replay
in packages that have the tracy client loaded. The issue turns out
to be an unlucky false positive in an rr heuristic that determines
the patchability of `rdtsc`. In rr master, it is possible to guarantee
patchability by using a specific `nopl; rdtsc` sequence. Add a patch
to tracy to do just that.

See also
rr-debugger/rr#3580
JuliaLang/julia#50975
@Keno
Copy link
Member Author

Keno commented Aug 19, 2023

Looks like the new test if failing on CI after my rebase onto the merged version of the other branch. Not sure what I did there, but I guess it's possible I typo'd something. Unfortunately, I'm on my way out of the office, but I'll take a look tomorrow.

@Keno
Copy link
Member Author

Keno commented Aug 19, 2023

Never mind, it's a simple bug where one of the conditions snuck to the wrong level of nesting - I'll just push a fix.

Keno added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Aug 19, 2023
Several users have reported masssive slowdowns to rr record/replay
in packages that have the tracy client loaded. The issue turns out
to be an unlucky false positive in an rr heuristic that determines
the patchability of `rdtsc`. In rr master, it is possible to guarantee
patchability by using a specific `nopl; rdtsc` sequence. Add a patch
to tracy to do just that.

See also
rr-debugger/rr#3580
JuliaLang/julia#50975
Keno added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Aug 20, 2023
Several users have reported masssive slowdowns to rr record/replay
in packages that have the tracy client loaded. The issue turns out
to be an unlucky false positive in an rr heuristic that determines
the patchability of `rdtsc`. In rr master, it is possible to guarantee
patchability by using a specific `nopl; rdtsc` sequence. Add a patch
to tracy to do just that.

See also
rr-debugger/rr#3580
JuliaLang/julia#50975
Keno added a commit to Keno/tracy that referenced this pull request Aug 20, 2023
We (Julia) ship both support for using tracy to trace julia applications,
as well as using `rr` (https://github.com/rr-debugger/rr) for record-replay debugging.
After our most recent rebuild of tracy, users have been reporting signfificant performance
slowdowns when `rr` recording a session that happens to also load the tracy library
(even if tracing is not enabled). Upon further examination, the recompile happened
to trigger a protective heuristic that disabled rr's patching of tracy's use of
`rdtsc` because an earlier part of the same function happened to look like a
conditional branch into the patch region. See rr-debugger/rr#3580
for details. To avoid this issue occurring again in future rebuilds of tracy,
adjust tracy's `rdtsc` sequence to be `nopl; rdtsc`, which (as of of the
linked PR) is a sequence that is guaranteed to bypass this heuristic
and not incur the additional overhead when run under rr.
topolarity pushed a commit to Keno/tracy that referenced this pull request Sep 21, 2023
We (Julia) ship both support for using tracy to trace julia applications,
as well as using `rr` (https://github.com/rr-debugger/rr) for record-replay debugging.
After our most recent rebuild of tracy, users have been reporting signfificant performance
slowdowns when `rr` recording a session that happens to also load the tracy library
(even if tracing is not enabled). Upon further examination, the recompile happened
to trigger a protective heuristic that disabled rr's patching of tracy's use of
`rdtsc` because an earlier part of the same function happened to look like a
conditional branch into the patch region. See rr-debugger/rr#3580
for details. To avoid this issue occurring again in future rebuilds of tracy,
adjust tracy's `rdtsc` sequence to be `nopl; rdtsc`, which (as of of the
linked PR) is a sequence that is guaranteed to bypass this heuristic
and not incur the additional overhead when run under rr.

A new compile-time flag `TRACY_NO_EXTRA_NOPSLEDS` is available to
disable this expanded `rdstsc` sequence if code size is a concern.
topolarity pushed a commit to Keno/tracy that referenced this pull request Sep 21, 2023
We (Julia) ship both support for using tracy to trace julia applications,
as well as using `rr` (https://github.com/rr-debugger/rr) for record-replay debugging.
After our most recent rebuild of tracy, users have been reporting signfificant performance
slowdowns when `rr` recording a session that happens to also load the tracy library
(even if tracing is not enabled). Upon further examination, the recompile happened
to trigger a protective heuristic that disabled rr's patching of tracy's use of
`rdtsc` because an earlier part of the same function happened to look like a
conditional branch into the patch region. See rr-debugger/rr#3580
for details. To avoid this issue occurring again in future rebuilds of tracy,
adjust tracy's `rdtsc` sequence to be `nopl; rdtsc`, which (as of of the
linked PR) is a sequence that is guaranteed to bypass this heuristic
and not incur the additional overhead when run under rr.

This functionality is kept behind a compile-time flag `TRACY_PATCHABLE_NOPSLEDS`
in order to avoid polluting the instruction cache unnecessarily.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants