-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a known-patchable sequence for rdtsc trapping #3580
Conversation
Note that the base branch of this is #3579, so that should be merged first, at which point GitHub should automatically retarget this branch. |
Rather than deal with this complexity, maybe we should choose a fixed non-canonical 5 byte NOP (the 5 byte NOP sequence has two bytes that are normally just 0 but could be anything we want) and use that to represent "I promise there are no interfering branches"? |
I think we could do that, but I'm not sure it's any better. The implementation complexity for implementing that would be about the same and I don't really see any benefit over just making it work as in this PR. I think if we can avoid magic extra semantics, we should just do that. |
9363924
to
22896db
Compare
I was looking some disassembly that looked like this: ``` 0x00007f5d7d0e7ffd <+29>: 48 89 7c 24 10 mov %rdi,0x10(%rsp) 0x00007f5d7d0e8002 <+34>: 4c 8d 64 24 20 lea 0x20(%rsp),%r12 0x00007f5d7d0e8007 <+39>: eb 37 jmp 0x7f5d7d0e8040 <_ZN5tracy8Profiler14CalibrateDelayEv+96> 0x00007f5d7d0e8009 <+41>: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 0x00007f5d7d0e8010 <+48>: e9 03 8b 1a 3c jmpq 0x7f5db9290b18 0x00007f5d7d0e8015 <+53>: 90 nop 0x00007f5d7d0e8016 <+54>: 4c 8d 2c 02 lea (%rdx,%rax,1),%r13 0x00007f5d7d0e801a <+58>: e8 31 60 fc ff callq 0x7f5d7d0ae050 <_ZN5tracy28HardwareSupportsInvariantTSCEv@plt> 0x00007f5d7d0e801f <+63>: 84 c0 test %al,%al 0x00007f5d7d0e8021 <+65>: 74 4a je 0x7f5d7d0e806d <_ZN5tracy8Profiler14CalibrateDelayEv+141> => 0x00007f5d7d0e8023 <+67>: 0f 31 rdtsc 0x00007f5d7d0e8025 <+69>: 48 c1 e2 20 shl $0x20,%rdx ``` We do have a syscallbuf template for `shl $0x20,%rdx`. Irritatingly, however, the `7c 24` at the top trigger's rr's interfering branch heuristic by false-positive coincidence. Even more unfortunately, this is a calibration loop for a tracing library, so it does actually just call `rdtsc` in a loop a bunch. Additionally this is shipped as a binary, so every user is hitting this unfortunate coincidence. Of course, since we're shipping the library, I can suggest we modify the source to make it more friendly to rr patching. That said, I wasn't quite sure what would be best to suggest here. At first I thought, I could simply insert a bunch of nops after. Unfortunately, that doesn't actually do anything, because the branch could still be interfering into the middle of the nop sled. Then I thought maybe a singular large `nop` would be sufficient, but of course, there could be something that intentionally conditionally skips the rdtsc. In this PR, I propose that we make the following sequence known-safe: ``` nopl 0(%ax, %ax, 1) # single instruction, 5-byte nop rdtsc ``` This currently wouldn't quite work, because an interfering jump to the `rdtsc` would simply hit the trailing nop padding, ignoring the rdtsc. However, if we slightly tweak the patch to instead use: ``` 1: jmp %hook [usual nop padding here] jmp 1b 2: ``` Then everything works out well. Our return address is past the entire patch region, so we return to the correct place and it doesn't matter whether we jump to the nop or to the instruction itself. Of course, and actual interfering branch over the nop is unlikely (though I supposed not impossible depending on what comes before), since interfering branches are supported, this nicely takes care of the spurious branch problem above (assuming the extra 5-byte nop is inserted).
Several users have reported masssive slowdowns to rr record/replay in packages that have the tracy client loaded. The issue turns out to be an unlucky false positive in an rr heuristic that determines the patchability of `rdtsc`. In rr master, it is possible to guarantee patchability by using a specific `nopl; rdtsc` sequence. Add a patch to tracy to do just that. See also rr-debugger/rr#3580 JuliaLang/julia#50975
Several users have reported masssive slowdowns to rr record/replay in packages that have the tracy client loaded. The issue turns out to be an unlucky false positive in an rr heuristic that determines the patchability of `rdtsc`. In rr master, it is possible to guarantee patchability by using a specific `nopl; rdtsc` sequence. Add a patch to tracy to do just that. See also rr-debugger/rr#3580 JuliaLang/julia#50975
Several users have reported masssive slowdowns to rr record/replay in packages that have the tracy client loaded. The issue turns out to be an unlucky false positive in an rr heuristic that determines the patchability of `rdtsc`. In rr master, it is possible to guarantee patchability by using a specific `nopl; rdtsc` sequence. Add a patch to tracy to do just that. See also rr-debugger/rr#3580 JuliaLang/julia#50975
Several users have reported masssive slowdowns to rr record/replay in packages that have the tracy client loaded. The issue turns out to be an unlucky false positive in an rr heuristic that determines the patchability of `rdtsc`. In rr master, it is possible to guarantee patchability by using a specific `nopl; rdtsc` sequence. Add a patch to tracy to do just that. See also rr-debugger/rr#3580 JuliaLang/julia#50975
Looks like the new test if failing on CI after my rebase onto the merged version of the other branch. Not sure what I did there, but I guess it's possible I typo'd something. Unfortunately, I'm on my way out of the office, but I'll take a look tomorrow. |
Never mind, it's a simple bug where one of the conditions snuck to the wrong level of nesting - I'll just push a fix. |
Several users have reported masssive slowdowns to rr record/replay in packages that have the tracy client loaded. The issue turns out to be an unlucky false positive in an rr heuristic that determines the patchability of `rdtsc`. In rr master, it is possible to guarantee patchability by using a specific `nopl; rdtsc` sequence. Add a patch to tracy to do just that. See also rr-debugger/rr#3580 JuliaLang/julia#50975
Several users have reported masssive slowdowns to rr record/replay in packages that have the tracy client loaded. The issue turns out to be an unlucky false positive in an rr heuristic that determines the patchability of `rdtsc`. In rr master, it is possible to guarantee patchability by using a specific `nopl; rdtsc` sequence. Add a patch to tracy to do just that. See also rr-debugger/rr#3580 JuliaLang/julia#50975
We (Julia) ship both support for using tracy to trace julia applications, as well as using `rr` (https://github.com/rr-debugger/rr) for record-replay debugging. After our most recent rebuild of tracy, users have been reporting signfificant performance slowdowns when `rr` recording a session that happens to also load the tracy library (even if tracing is not enabled). Upon further examination, the recompile happened to trigger a protective heuristic that disabled rr's patching of tracy's use of `rdtsc` because an earlier part of the same function happened to look like a conditional branch into the patch region. See rr-debugger/rr#3580 for details. To avoid this issue occurring again in future rebuilds of tracy, adjust tracy's `rdtsc` sequence to be `nopl; rdtsc`, which (as of of the linked PR) is a sequence that is guaranteed to bypass this heuristic and not incur the additional overhead when run under rr.
We (Julia) ship both support for using tracy to trace julia applications, as well as using `rr` (https://github.com/rr-debugger/rr) for record-replay debugging. After our most recent rebuild of tracy, users have been reporting signfificant performance slowdowns when `rr` recording a session that happens to also load the tracy library (even if tracing is not enabled). Upon further examination, the recompile happened to trigger a protective heuristic that disabled rr's patching of tracy's use of `rdtsc` because an earlier part of the same function happened to look like a conditional branch into the patch region. See rr-debugger/rr#3580 for details. To avoid this issue occurring again in future rebuilds of tracy, adjust tracy's `rdtsc` sequence to be `nopl; rdtsc`, which (as of of the linked PR) is a sequence that is guaranteed to bypass this heuristic and not incur the additional overhead when run under rr. A new compile-time flag `TRACY_NO_EXTRA_NOPSLEDS` is available to disable this expanded `rdstsc` sequence if code size is a concern.
We (Julia) ship both support for using tracy to trace julia applications, as well as using `rr` (https://github.com/rr-debugger/rr) for record-replay debugging. After our most recent rebuild of tracy, users have been reporting signfificant performance slowdowns when `rr` recording a session that happens to also load the tracy library (even if tracing is not enabled). Upon further examination, the recompile happened to trigger a protective heuristic that disabled rr's patching of tracy's use of `rdtsc` because an earlier part of the same function happened to look like a conditional branch into the patch region. See rr-debugger/rr#3580 for details. To avoid this issue occurring again in future rebuilds of tracy, adjust tracy's `rdtsc` sequence to be `nopl; rdtsc`, which (as of of the linked PR) is a sequence that is guaranteed to bypass this heuristic and not incur the additional overhead when run under rr. This functionality is kept behind a compile-time flag `TRACY_PATCHABLE_NOPSLEDS` in order to avoid polluting the instruction cache unnecessarily.
I was looking some disassembly that looked like this:
We do have a syscallbuf template for
shl $0x20,%rdx
. Irritatingly, however, the7c 24
at the top trigger's rr's interfering branch heuristic by false-positive coincidence. Even more unfortunately, this is a calibration loop for a tracing library, so it does actually just callrdtsc
in a loop a bunch. Additionally this is shipped as a binary, so every user is hitting this unfortunate coincidence. Of course, since we're shipping the library, I can suggest we modify the source to make it more friendly to rr patching. That said, I wasn't quite sure what would be best to suggest here.At first I thought, I could simply insert a bunch of nops after. Unfortunately, that doesn't actually do anything, because the branch could still be interfering into the middle of the nop sled.
Then I thought maybe a singular large
nop
would be sufficient, but of course, there could be something that intentionally conditionally skips the rdtsc.In this PR, I propose that we make the following sequence known-safe:
This currently wouldn't quite work, because an interfering jump to the
rdtsc
would simply hit the trailing nop padding, ignoring the rdtsc. However, if we slightly tweak the patch to instead use:Then everything works out well. Our return address is past the entire patch region, so we return to the correct place and it doesn't matter whether we jump to the nop or to the instruction itself. Of course, and actual interfering branch over the nop is unlikely (though I supposed not impossible depending on what comes before), since interfering branches are supported, this nicely takes care of the spurious branch problem above (assuming the extra 5-byte nop is inserted).