Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to C implementation of EventPipe #46079

Closed
5 of 6 tasks
josalem opened this issue Dec 15, 2020 · 19 comments · Fixed by #47665
Closed
5 of 6 tasks

Switch to C implementation of EventPipe #46079

josalem opened this issue Dec 15, 2020 · 19 comments · Fixed by #47665
Assignees
Labels
area-Tracing-coreclr area-Tracing-mono enhancement Product code improvement that does NOT require public API changes/additions tracking This issue is tracking the completion of other related issues.
Milestone

Comments

@josalem
Copy link
Contributor

josalem commented Dec 15, 2020

We should have a CI pipeline that tests the C version of EventPipe until we have permanently turned on the feature flag. This will help to make sure we don't have any regressions in the meantime.

This issue will track progress towards switching over to the C implementation of EventPipe. Currently, it is behind a feature flag. Before we transition, we should collect the following information:

  • code size comparison
    • See @am11's comment below for some rudimentary numbers
  • throughput comparisons
    • We should document throughputs for both the C and C++ implementations before removing the C++ implementation.
  • regressions analysis with external tools
    • We have tests in the runtime repo that test the IPC protocols and EventPipe directly, but we don't have tests that cover the end-to-end scenario when it comes to dotnet-trace, etw, or other external tools (We do in dotnet/diagnostics though). We should do some local testing to make sure we aren't breaking any scenarios (I doubt we are).
  • Regressions from C++

CC - @tommcdon @lateralusX

@josalem josalem added this to the 6.0.0 milestone Dec 15, 2020
@josalem josalem self-assigned this Dec 15, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 15, 2020
@jkotas
Copy link
Member

jkotas commented Dec 15, 2020

What is the remaining work that prevents us from switching to the C flavor of eventpipe by default?

@josalem
Copy link
Contributor Author

josalem commented Dec 15, 2020

As I itemize this work I'm not seeing anything that prevents us from switching by default.

  • Code size comparisons
  • throughput comparisons
  • regression analysis with external tools

@lateralusX is there anything else you can think of? If the prevailing opinion is to "switch now and stabilize in master" I don't have any immediate objections. We could always revert the feature flag PR if CI starts failing.

CC @sywhang @tommcdon @noahfalk

@am11
Copy link
Member

am11 commented Dec 15, 2020

FWIW, high-level figures suggest overall decrease in library physical size after switching to C implementation. On Linux x64, libmscordbi.so size increased by 4 KiB, while libcoreclr.so decreased by 11.3 KiB (as of e29b839).

lib after size - before size diff
libcoreclr.so 7037856 - 7049408 -11552
libdbgshim.so 721808 - 721776 +32
libmscordbi.so 1820032 - 1815904 +4128

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Dec 15, 2020
@josalem
Copy link
Contributor Author

josalem commented Dec 15, 2020

Chatted offline with Tom about this. Let's instead aim for doing the switch and stabilizing in master as Jan mentioned. I'll run some local tests for the bullet points above and document them here. I'll repurpose this issue to track switching over to the C version instead of adding a CI pipeline.

@josalem josalem added enhancement Product code improvement that does NOT require public API changes/additions and removed test-enhancement Improvements of test source code labels Dec 15, 2020
@josalem josalem changed the title Add a CI pipeline with the feature flag for the C flavor of EventPipe Switch to C implementation of EventPipe Dec 15, 2020
@josalem
Copy link
Contributor Author

josalem commented Dec 15, 2020

Just did a size comparison for macos-x64 at commit 7ae298e

lib after size - before size diff
libcoreclr.dylib 7033296 - 7042176 -8880
libdbgshim.dylib 657360 - 657040 +320
libmscordbi.dylib 1966096 - 1965744 +352

Seems to be meeting expectations that it would reduce code size. My assumption is that the changes to dbgshim and the mscordbi are most likely noise.

@josalem
Copy link
Contributor Author

josalem commented Dec 16, 2020

One hiccup already.

I started my attempts at collecting end-to-end throughput measurements, but I'm getting an AV in a Release macos-x64 build with the C implementation that doesn't happen with the C++ implementation.

Process 69371 resuming
Process 69371 stopped
* thread #11, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00000001041bb967 libcoreclr.dylib`ep_buffer_manager_write_event(_EventPipeBufferManager*, Thread*, _EventPipeSession*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 151
libcoreclr.dylib`ep_buffer_manager_write_event:
->  0x1041bb967 <+151>: movq   0x10(%rbx), %rdi
    0x1041bb96b <+155>: testq  %rdi, %rdi
    0x1041bb96e <+158>: jne    0x1041bba0b               ; <+315>
    0x1041bb974 <+164>: jmp    0x1041bba4e               ; <+382>
Target 0: (corescaletest) stopped.
(lldb) bt
* thread #11, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00000001041bb967 libcoreclr.dylib`ep_buffer_manager_write_event(_EventPipeBufferManager*, Thread*, _EventPipeSession*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 151
    frame #1: 0x00000001041c3d34 libcoreclr.dylib`ep_session_write_event(_EventPipeSession*, Thread*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 292
    frame #2: 0x00000001041c6b8a libcoreclr.dylib`write_event_2(Thread*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 234
    frame #3: 0x00000001041c0fff libcoreclr.dylib`ep_write_event_2(_EventPipeEvent*, _EventData*, unsigned int, unsigned char const*, unsigned char const*) + 191
    frame #4: 0x0000000103f3191d libcoreclr.dylib`EventPipeInternal::WriteEventData(long, _EventData*, unsigned int, _GUID const*, _GUID const*) + 61
    frame #5: 0x000000011abdea9d
    frame #6: 0x000000011abde62d
    frame #7: 0x000000011abde1da
    frame #8: 0x000000011abddc85
    frame #9: 0x000000011abddc44
    frame #10: 0x000000011a732912
    frame #11: 0x000000011a73dab1
    frame #12: 0x000000011a732a2e
    frame #13: 0x0000000104097019 libcoreclr.dylib`CallDescrWorkerInternal + 124
    frame #14: 0x0000000103efe00f libcoreclr.dylib`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1519
    frame #15: 0x0000000103f0e984 libcoreclr.dylib`ThreadNative::KickOffThread_Worker(void*) + 404
    frame #16: 0x0000000103ec859b libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 315
    frame #17: 0x0000000103ec8b50 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) + 32
    frame #18: 0x0000000103f0eaff libcoreclr.dylib`ThreadNative::KickOffThread(void*) + 191
    frame #19: 0x0000000103dbc14a libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(void*) + 426
    frame #20: 0x00007fff719bf109 libsystem_pthread.dylib`_pthread_start + 148
    frame #21: 0x00007fff719bab8b libsystem_pthread.dylib`thread_start + 15

I'll try switching to a checked build and see if I can't pin down what's happening.

For reference, the test I'm running writes events across N threads (N == number of cores on machine, 4 in my case) without pausing in between. It's meant to saturate the buffer system and drop events. I have levers for reducing that stream to more realistic values and simulate bursts and periodic behavior.

CC @lateralusX

@josalem
Copy link
Contributor Author

josalem commented Dec 16, 2020

The above AV doesn't happen under a checked build of the C impl, but a checked build triggers an assert during process shutdown if the app has a tracing session open.

Assert failure(PID 21410 [0x000053a2], Thread: 23643989 [0x168c755]): thread->session_state [i] == 0
    File: /Users/josalem/git/alts/runtime/src/native/eventpipe/ep-thread.c Line: 64
    Image: /Users/josalem/git/EventPipeLab/scalability/corescaletest/c-checked/corescaletest

This is a valid scenario that works under the C++ implementation.

@sywhang
Copy link
Contributor

sywhang commented Dec 16, 2020

@josalem can we file separate issues for each problem we encounter? I’m guessing this is only the beginning of issues we’ll start discovering.

@lateralusX
Copy link
Member

@josalem

Possible to get the app you are running hitting the AV so I can try to get local repro on Windows? Is the AV also happening during shutdown of the app or was that just the assert?

Would be good to have a little more assembler of ep_buffer_manager_write_event from the crash in order to try to pin point where in source that could be, but one thing comes to mind is a write that is targeting a session currently in the process of being disabled. Did you have any other threads in that core dump that seemed to be in the process of disabling a session (potentially the one targeted by the write)?

I will try to run some multithreaded scenarios on Windows to see if something pops up.

@lateralusX
Copy link
Member

Also there are a couple of changes in C++ library not yet ported into C library, not sure it will affect this, but just pointing that out as well. Will try to get to that by end of week (porting the missing PR's over to C library).

@marek-safar marek-safar added the tracking This issue is tracking the completion of other related issues. label Dec 16, 2020
@lateralusX
Copy link
Member

Have captured something under the debugger that is probably related. Seems to be problems with ref count on EventPipeThread meaning that we apparently could end up with a released thread during session disable (when we delete thread session state). Thread ref counting and EventPipeThreadHolder ties into thread local storage and relies on implementation in the shim. Now I have something to investigate future around.

@lateralusX
Copy link
Member

And if I get passed disabling session, next time I start a new session and writing an event I get a callstack very similar to the above, hitting the same issue, thread in tls has been incorrectly freed at some previous point in time. If I don't disable the session, all 8 threads continue to write events into session as expected.

@lateralusX
Copy link
Member

lateralusX commented Dec 16, 2020

Think I pinpointed the issue, it is a race between TLS Thread holder doing free on destructor when a thread terminates (in this case the streaming thread) and a call disabling a session, iterating all threads in thread list. The disable thread will get a copy of all threads (done while holding lock and addref each thread) in list and then iterate them, but since thread calling release from destructor does the atomic decrement without holding the lock and then takes lock and remove itself from list opens up for a race where a thread on the list is about to be destructed, thread calling release hit refcount to 0, but disable thread takes copies of all threads and bumps refcount back to 1 just before thread calling release takes lock, removes thread from list and free thread. This results in disable thread holding a copy of a freed thread, that in turn will cause random side effects.

Looks like the same race exists in the C++ implementation. I will see if I can implement a fix in the C library to begin with and based on outcome we can probably do the same fix in C++.

@lateralusX
Copy link
Member

lateralusX commented Dec 16, 2020

So we probably need to split the concept of having a reference to an EventPipeThread object, handling lifetime of object instances and the add/remove from the global thread list in EventPipe. Since a thread can die at any time, triggering the TLS destructor it already means we can have EventPipeThread objects out living the physical thread. So instead of removing the thread from the list as part of hitting a 0 ref count as part of EventPipeThread::Release, we should move that logic into the lifetime of the thread (tracked by our EventPipeThreadHolder thread_local), meaning that we first remove from thread from the list in thread_local gCurrentEventPipeThreadHolder destructor and then decrement ref count, that will prevent us from hitting ref count 0 just before another thread takes copy of list ending up with a thread about to be deleted from the list and freed. I will experiment with that fix.

@josalem
Copy link
Contributor Author

josalem commented Dec 16, 2020

can we file separate issues for each problem we encounter?

Yes. Let's use this issue as the higher-level tracking issue and break individual issues out as the come up.

Possible to get the app you are running hitting the AV so I can try to get local repro on Windows? Is the AV also happening during shutdown of the app or was that just the assert?

Yes. Let me open separate issues for each thing and add details there. I'll copy your notes over to the individual issues. The assert and the AV are two different issues since the AV is happening almost immediately and the assert is happening during app shutdown after sending events for 60s. I'll package up the stress suite I've been using for reuse.

@lateralusX
Copy link
Member

Added findings into the two issues, working on a fix for C library, but the race is also in C++ code, so once we have a verified fix we could discuss if it should be ported or if we ignore since we are moving over to the C library anyways.

@lateralusX
Copy link
Member

lateralusX commented Dec 16, 2020

Fixes seems to have positive impact on Windows, won't be able to repro issues (that I could do locally before changes) running 8 threads writing events into a dotnet-trace session using EventPipe C library. Will collect changes into PR tomorrow morning.

@lateralusX
Copy link
Member

PR addressing both identified AV and assert, #46193.

This was referenced Jan 20, 2021
@josalem
Copy link
Contributor Author

josalem commented Jan 30, 2021

I did some local testing with the diagnostics tools and found them to be in working order with the C implementation.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 30, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 1, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tracing-coreclr area-Tracing-mono enhancement Product code improvement that does NOT require public API changes/additions tracking This issue is tracking the completion of other related issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants