Switch to C implementation of EventPipe #46079

josalem · 2020-12-15T17:30:01Z

~~We should have a CI pipeline that tests the C version of EventPipe until we have permanently turned on the feature flag. This will help to make sure we don't have any regressions in the meantime.~~

This issue will track progress towards switching over to the C implementation of EventPipe. Currently, it is behind a feature flag. Before we transition, we should collect the following information:

code size comparison
- See @am11's comment below for some rudimentary numbers
throughput comparisons
- We should document throughputs for both the C and C++ implementations before removing the C++ implementation.
regressions analysis with external tools
- We have tests in the runtime repo that test the IPC protocols and EventPipe directly, but we don't have tests that cover the end-to-end scenario when it comes to dotnet-trace, etw, or other external tools (We do in dotnet/diagnostics though). We should do some local testing to make sure we aren't breaking any scenarios (I doubt we are).
Regressions from C++
- Release builds of C impl AV inside ep_buffer_manager_write_event (AV in Release build of C implementation of EventPipe #46158)
- Checked builds of C assert if app has is being traced during shutdown, e.g., tracing and app to completion. (Assert on process shutdown while tracing C implementation of EventPipe #46159)
  - Perhaps this assert comes from the Mono requirement to be able to clean up an restart execution of the runtime vs CoreCLR's expectation that shutdown will lead to the OS cleaning things up?
  - Not tracked by this issue: we should add a test that tests this scenario as it is expected to work.

CC - @tommcdon @lateralusX

The text was updated successfully, but these errors were encountered:

jkotas · 2020-12-15T18:11:10Z

What is the remaining work that prevents us from switching to the C flavor of eventpipe by default?

josalem · 2020-12-15T18:16:53Z

As I itemize this work I'm not seeing anything that prevents us from switching by default.

Code size comparisons
throughput comparisons
regression analysis with external tools

@lateralusX is there anything else you can think of? If the prevailing opinion is to "switch now and stabilize in master" I don't have any immediate objections. We could always revert the feature flag PR if CI starts failing.

CC @sywhang @tommcdon @noahfalk

am11 · 2020-12-15T19:12:55Z

FWIW, high-level figures suggest overall decrease in library physical size after switching to C implementation. On Linux x64, libmscordbi.so size increased by 4 KiB, while libcoreclr.so decreased by 11.3 KiB (as of e29b839).

lib	after size - before size	diff
libcoreclr.so	7037856 - `7049408`	-11552
libdbgshim.so	721808 - 721776	+32
libmscordbi.so	1820032 - 1815904	+4128

josalem · 2020-12-15T19:21:59Z

Chatted offline with Tom about this. Let's instead aim for doing the switch and stabilizing in master as Jan mentioned. I'll run some local tests for the bullet points above and document them here. I'll repurpose this issue to track switching over to the C version instead of adding a CI pipeline.

josalem · 2020-12-15T23:22:18Z

Just did a size comparison for macos-x64 at commit 7ae298e

lib	after size - before size	diff
libcoreclr.dylib	7033296 - `7042176`	-8880
libdbgshim.dylib	657360 - 657040	+320
libmscordbi.dylib	1966096 - `1965744`	+352

Seems to be meeting expectations that it would reduce code size. My assumption is that the changes to dbgshim and the mscordbi are most likely noise.

josalem · 2020-12-16T00:01:32Z

One hiccup already.

I started my attempts at collecting end-to-end throughput measurements, but I'm getting an AV in a Release macos-x64 build with the C implementation that doesn't happen with the C++ implementation.

Process 69371 resuming
Process 69371 stopped
* thread #11, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00000001041bb967 libcoreclr.dylib`ep_buffer_manager_write_event(_EventPipeBufferManager*, Thread*, _EventPipeSession*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 151
libcoreclr.dylib`ep_buffer_manager_write_event:
->  0x1041bb967 <+151>: movq   0x10(%rbx), %rdi
    0x1041bb96b <+155>: testq  %rdi, %rdi
    0x1041bb96e <+158>: jne    0x1041bba0b               ; <+315>
    0x1041bb974 <+164>: jmp    0x1041bba4e               ; <+382>
Target 0: (corescaletest) stopped.
(lldb) bt
* thread #11, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00000001041bb967 libcoreclr.dylib`ep_buffer_manager_write_event(_EventPipeBufferManager*, Thread*, _EventPipeSession*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 151
    frame #1: 0x00000001041c3d34 libcoreclr.dylib`ep_session_write_event(_EventPipeSession*, Thread*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 292
    frame #2: 0x00000001041c6b8a libcoreclr.dylib`write_event_2(Thread*, _EventPipeEvent*, _EventPipeEventPayload*, unsigned char const*, unsigned char const*, Thread*, _EventPipeStackContents*) + 234
    frame #3: 0x00000001041c0fff libcoreclr.dylib`ep_write_event_2(_EventPipeEvent*, _EventData*, unsigned int, unsigned char const*, unsigned char const*) + 191
    frame #4: 0x0000000103f3191d libcoreclr.dylib`EventPipeInternal::WriteEventData(long, _EventData*, unsigned int, _GUID const*, _GUID const*) + 61
    frame #5: 0x000000011abdea9d
    frame #6: 0x000000011abde62d
    frame #7: 0x000000011abde1da
    frame #8: 0x000000011abddc85
    frame #9: 0x000000011abddc44
    frame #10: 0x000000011a732912
    frame #11: 0x000000011a73dab1
    frame #12: 0x000000011a732a2e
    frame #13: 0x0000000104097019 libcoreclr.dylib`CallDescrWorkerInternal + 124
    frame #14: 0x0000000103efe00f libcoreclr.dylib`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 1519
    frame #15: 0x0000000103f0e984 libcoreclr.dylib`ThreadNative::KickOffThread_Worker(void*) + 404
    frame #16: 0x0000000103ec859b libcoreclr.dylib`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 315
    frame #17: 0x0000000103ec8b50 libcoreclr.dylib`ManagedThreadBase::KickOff(void (*)(void*), void*) + 32
    frame #18: 0x0000000103f0eaff libcoreclr.dylib`ThreadNative::KickOffThread(void*) + 191
    frame #19: 0x0000000103dbc14a libcoreclr.dylib`CorUnix::CPalThread::ThreadEntry(void*) + 426
    frame #20: 0x00007fff719bf109 libsystem_pthread.dylib`_pthread_start + 148
    frame #21: 0x00007fff719bab8b libsystem_pthread.dylib`thread_start + 15

I'll try switching to a checked build and see if I can't pin down what's happening.

For reference, the test I'm running writes events across N threads (N == number of cores on machine, 4 in my case) without pausing in between. It's meant to saturate the buffer system and drop events. I have levers for reducing that stream to more realistic values and simulate bursts and periodic behavior.

CC @lateralusX

josalem · 2020-12-16T00:37:42Z

The above AV doesn't happen under a checked build of the C impl, but a checked build triggers an assert during process shutdown if the app has a tracing session open.

Assert failure(PID 21410 [0x000053a2], Thread: 23643989 [0x168c755]): thread->session_state [i] == 0
    File: /Users/josalem/git/alts/runtime/src/native/eventpipe/ep-thread.c Line: 64
    Image: /Users/josalem/git/EventPipeLab/scalability/corescaletest/c-checked/corescaletest

This is a valid scenario that works under the C++ implementation.

sywhang · 2020-12-16T02:29:58Z

@josalem can we file separate issues for each problem we encounter? I’m guessing this is only the beginning of issues we’ll start discovering.

lateralusX · 2020-12-16T10:42:10Z

@josalem

Possible to get the app you are running hitting the AV so I can try to get local repro on Windows? Is the AV also happening during shutdown of the app or was that just the assert?

Would be good to have a little more assembler of ep_buffer_manager_write_event from the crash in order to try to pin point where in source that could be, but one thing comes to mind is a write that is targeting a session currently in the process of being disabled. Did you have any other threads in that core dump that seemed to be in the process of disabling a session (potentially the one targeted by the write)?

I will try to run some multithreaded scenarios on Windows to see if something pops up.

lateralusX · 2020-12-16T10:48:11Z

Also there are a couple of changes in C++ library not yet ported into C library, not sure it will affect this, but just pointing that out as well. Will try to get to that by end of week (porting the missing PR's over to C library).

lateralusX · 2020-12-16T13:45:58Z

Have captured something under the debugger that is probably related. Seems to be problems with ref count on EventPipeThread meaning that we apparently could end up with a released thread during session disable (when we delete thread session state). Thread ref counting and EventPipeThreadHolder ties into thread local storage and relies on implementation in the shim. Now I have something to investigate future around.

lateralusX · 2020-12-16T13:51:26Z

And if I get passed disabling session, next time I start a new session and writing an event I get a callstack very similar to the above, hitting the same issue, thread in tls has been incorrectly freed at some previous point in time. If I don't disable the session, all 8 threads continue to write events into session as expected.

lateralusX · 2020-12-16T14:41:47Z

Think I pinpointed the issue, it is a race between TLS Thread holder doing free on destructor when a thread terminates (in this case the streaming thread) and a call disabling a session, iterating all threads in thread list. The disable thread will get a copy of all threads (done while holding lock and addref each thread) in list and then iterate them, but since thread calling release from destructor does the atomic decrement without holding the lock and then takes lock and remove itself from list opens up for a race where a thread on the list is about to be destructed, thread calling release hit refcount to 0, but disable thread takes copies of all threads and bumps refcount back to 1 just before thread calling release takes lock, removes thread from list and free thread. This results in disable thread holding a copy of a freed thread, that in turn will cause random side effects.

Looks like the same race exists in the C++ implementation. I will see if I can implement a fix in the C library to begin with and based on outcome we can probably do the same fix in C++.

lateralusX · 2020-12-16T15:24:04Z

So we probably need to split the concept of having a reference to an EventPipeThread object, handling lifetime of object instances and the add/remove from the global thread list in EventPipe. Since a thread can die at any time, triggering the TLS destructor it already means we can have EventPipeThread objects out living the physical thread. So instead of removing the thread from the list as part of hitting a 0 ref count as part of EventPipeThread::Release, we should move that logic into the lifetime of the thread (tracked by our EventPipeThreadHolder thread_local), meaning that we first remove from thread from the list in thread_local gCurrentEventPipeThreadHolder destructor and then decrement ref count, that will prevent us from hitting ref count 0 just before another thread takes copy of list ending up with a thread about to be deleted from the list and freed. I will experiment with that fix.

josalem · 2020-12-16T18:50:38Z

can we file separate issues for each problem we encounter?

Yes. Let's use this issue as the higher-level tracking issue and break individual issues out as the come up.

Possible to get the app you are running hitting the AV so I can try to get local repro on Windows? Is the AV also happening during shutdown of the app or was that just the assert?

Yes. Let me open separate issues for each thing and add details there. I'll copy your notes over to the individual issues. The assert and the AV are two different issues since the AV is happening almost immediately and the assert is happening during app shutdown after sending events for 60s. I'll package up the stress suite I've been using for reuse.

lateralusX · 2020-12-16T19:39:40Z

Added findings into the two issues, working on a fix for C library, but the race is also in C++ code, so once we have a verified fix we could discuss if it should be ported or if we ignore since we are moving over to the C library anyways.

lateralusX · 2020-12-16T21:59:26Z

Fixes seems to have positive impact on Windows, won't be able to repro issues (that I could do locally before changes) running 8 threads writing events into a dotnet-trace session using EventPipe C library. Will collect changes into PR tomorrow morning.

lateralusX · 2020-12-17T16:35:00Z

PR addressing both identified AV and assert, #46193.

josalem · 2021-01-30T00:12:53Z

I did some local testing with the diagnostics tools and found them to be in working order with the C implementation.

josalem added test-enhancement Improvements of test source code area-Tracing-coreclr area-Tracing-mono labels Dec 15, 2020

josalem added this to the 6.0.0 milestone Dec 15, 2020

josalem self-assigned this Dec 15, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 15, 2020

tommcdon removed the untriaged New issue has not been triaged by the area owner label Dec 15, 2020

josalem added enhancement Product code improvement that does NOT require public API changes/additions and removed test-enhancement Improvements of test source code labels Dec 15, 2020

josalem changed the title ~~Add a CI pipeline with the feature flag for the C flavor of EventPipe~~ Switch to C implementation of EventPipe Dec 15, 2020

marek-safar added the tracking This issue is tracking the completion of other related issues. label Dec 16, 2020

This was referenced Dec 16, 2020

AV in Release build of C implementation of EventPipe #46158

Closed

Assert on process shutdown while tracing C implementation of EventPipe #46159

Closed

This was referenced Jan 20, 2021

Improve Eventing Fundamentals #45518

Closed

bandaid for #44072 #47285

Merged

josalem mentioned this issue Jan 30, 2021

Use C EventPipe implementation by default #47665

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 30, 2021

josalem closed this as completed in #47665 Feb 1, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 1, 2021

ghost locked as resolved and limited conversation to collaborators Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to C implementation of EventPipe #46079

Switch to C implementation of EventPipe #46079

josalem commented Dec 15, 2020 •

edited

Loading

jkotas commented Dec 15, 2020

josalem commented Dec 15, 2020 •

edited

Loading

am11 commented Dec 15, 2020

josalem commented Dec 15, 2020

josalem commented Dec 15, 2020 •

edited

Loading

josalem commented Dec 16, 2020

josalem commented Dec 16, 2020

sywhang commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020 •

edited

Loading

lateralusX commented Dec 16, 2020 •

edited

Loading

josalem commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020 •

edited

Loading

lateralusX commented Dec 17, 2020

josalem commented Jan 30, 2021

Switch to C implementation of EventPipe #46079

Switch to C implementation of EventPipe #46079

Comments

josalem commented Dec 15, 2020 • edited Loading

jkotas commented Dec 15, 2020

josalem commented Dec 15, 2020 • edited Loading

am11 commented Dec 15, 2020

josalem commented Dec 15, 2020

josalem commented Dec 15, 2020 • edited Loading

josalem commented Dec 16, 2020

josalem commented Dec 16, 2020

sywhang commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020 • edited Loading

lateralusX commented Dec 16, 2020 • edited Loading

josalem commented Dec 16, 2020

lateralusX commented Dec 16, 2020

lateralusX commented Dec 16, 2020 • edited Loading

lateralusX commented Dec 17, 2020

josalem commented Jan 30, 2021

josalem commented Dec 15, 2020 •

edited

Loading

josalem commented Dec 15, 2020 •

edited

Loading

josalem commented Dec 15, 2020 •

edited

Loading

lateralusX commented Dec 16, 2020 •

edited

Loading

lateralusX commented Dec 16, 2020 •

edited

Loading

lateralusX commented Dec 16, 2020 •

edited

Loading