-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to C implementation of EventPipe #46079
Comments
What is the remaining work that prevents us from switching to the C flavor of eventpipe by default? |
As I itemize this work I'm not seeing anything that prevents us from switching by default.
@lateralusX is there anything else you can think of? If the prevailing opinion is to "switch now and stabilize in master" I don't have any immediate objections. We could always revert the feature flag PR if CI starts failing. |
FWIW, high-level figures suggest overall decrease in library physical size after switching to C implementation. On Linux x64, libmscordbi.so size increased by 4 KiB, while libcoreclr.so decreased by 11.3 KiB (as of e29b839).
|
Chatted offline with Tom about this. Let's instead aim for doing the switch and stabilizing in master as Jan mentioned. I'll run some local tests for the bullet points above and document them here. I'll repurpose this issue to track switching over to the C version instead of adding a CI pipeline. |
Just did a size comparison for macos-x64 at commit 7ae298e
Seems to be meeting expectations that it would reduce code size. My assumption is that the changes to dbgshim and the mscordbi are most likely noise. |
One hiccup already. I started my attempts at collecting end-to-end throughput measurements, but I'm getting an AV in a Release macos-x64 build with the C implementation that doesn't happen with the C++ implementation.
I'll try switching to a checked build and see if I can't pin down what's happening. For reference, the test I'm running writes events across N threads (N == number of cores on machine, 4 in my case) without pausing in between. It's meant to saturate the buffer system and drop events. I have levers for reducing that stream to more realistic values and simulate bursts and periodic behavior. CC @lateralusX |
The above AV doesn't happen under a checked build of the C impl, but a checked build triggers an assert during process shutdown if the app has a tracing session open.
This is a valid scenario that works under the C++ implementation. |
@josalem can we file separate issues for each problem we encounter? I’m guessing this is only the beginning of issues we’ll start discovering. |
Possible to get the app you are running hitting the AV so I can try to get local repro on Windows? Is the AV also happening during shutdown of the app or was that just the assert? Would be good to have a little more assembler of ep_buffer_manager_write_event from the crash in order to try to pin point where in source that could be, but one thing comes to mind is a write that is targeting a session currently in the process of being disabled. Did you have any other threads in that core dump that seemed to be in the process of disabling a session (potentially the one targeted by the write)? I will try to run some multithreaded scenarios on Windows to see if something pops up. |
Also there are a couple of changes in C++ library not yet ported into C library, not sure it will affect this, but just pointing that out as well. Will try to get to that by end of week (porting the missing PR's over to C library). |
Have captured something under the debugger that is probably related. Seems to be problems with ref count on EventPipeThread meaning that we apparently could end up with a released thread during session disable (when we delete thread session state). Thread ref counting and EventPipeThreadHolder ties into thread local storage and relies on implementation in the shim. Now I have something to investigate future around. |
And if I get passed disabling session, next time I start a new session and writing an event I get a callstack very similar to the above, hitting the same issue, thread in tls has been incorrectly freed at some previous point in time. If I don't disable the session, all 8 threads continue to write events into session as expected. |
Think I pinpointed the issue, it is a race between TLS Thread holder doing free on destructor when a thread terminates (in this case the streaming thread) and a call disabling a session, iterating all threads in thread list. The disable thread will get a copy of all threads (done while holding lock and addref each thread) in list and then iterate them, but since thread calling release from destructor does the atomic decrement without holding the lock and then takes lock and remove itself from list opens up for a race where a thread on the list is about to be destructed, thread calling release hit refcount to 0, but disable thread takes copies of all threads and bumps refcount back to 1 just before thread calling release takes lock, removes thread from list and free thread. This results in disable thread holding a copy of a freed thread, that in turn will cause random side effects. Looks like the same race exists in the C++ implementation. I will see if I can implement a fix in the C library to begin with and based on outcome we can probably do the same fix in C++. |
So we probably need to split the concept of having a reference to an EventPipeThread object, handling lifetime of object instances and the add/remove from the global thread list in EventPipe. Since a thread can die at any time, triggering the TLS destructor it already means we can have EventPipeThread objects out living the physical thread. So instead of removing the thread from the list as part of hitting a 0 ref count as part of EventPipeThread::Release, we should move that logic into the lifetime of the thread (tracked by our EventPipeThreadHolder thread_local), meaning that we first remove from thread from the list in thread_local gCurrentEventPipeThreadHolder destructor and then decrement ref count, that will prevent us from hitting ref count 0 just before another thread takes copy of list ending up with a thread about to be deleted from the list and freed. I will experiment with that fix. |
Yes. Let's use this issue as the higher-level tracking issue and break individual issues out as the come up.
Yes. Let me open separate issues for each thing and add details there. I'll copy your notes over to the individual issues. The assert and the AV are two different issues since the AV is happening almost immediately and the assert is happening during app shutdown after sending events for 60s. I'll package up the stress suite I've been using for reuse. |
Added findings into the two issues, working on a fix for C library, but the race is also in C++ code, so once we have a verified fix we could discuss if it should be ported or if we ignore since we are moving over to the C library anyways. |
Fixes seems to have positive impact on Windows, won't be able to repro issues (that I could do locally before changes) running 8 threads writing events into a dotnet-trace session using EventPipe C library. Will collect changes into PR tomorrow morning. |
PR addressing both identified AV and assert, #46193. |
I did some local testing with the diagnostics tools and found them to be in working order with the C implementation. |
We should have a CI pipeline that tests the C version of EventPipe until we have permanently turned on the feature flag. This will help to make sure we don't have any regressions in the meantime.This issue will track progress towards switching over to the C implementation of EventPipe. Currently, it is behind a feature flag. Before we transition, we should collect the following information:
ep_buffer_manager_write_event
(AV in Release build of C implementation of EventPipe #46158)CC - @tommcdon @lateralusX
The text was updated successfully, but these errors were encountered: