-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support EventPipe high core count CPU scalability #1412
Comments
One thing to consider when investigating this - if in fact there is high contention on |
We had another report of this issue come in. I was suspicious that it would be a case where the buffer was saturated but looking over traces events aren't getting dropped. Best guess we've got a lock convoy as the number of threads rise and because this is a spin lock it is very CPU intensive to wait. We either need to decrease contention on the lock, eliminate it completely using lockless strategies, or replace it with lock that has bounded spinning. |
@noahfalk @brianrob In our setup, we have a companion application that setups the EventPipe session and retrieves them from the stream/socket. The companion application configures the EventPipe session to get informational events from GC,Contention,Exception keywords. The main application runs in a container with 30 CPUs allocated. When deactivating the event emission, the issue was not occurring. We took cpu samples and a lot of them were like that:
Which means that every event emission was impacted by this lock. We measure the time between the emission time and the time the event is actually processed by the companion application. We can see that this time is growing over time. I exposed the m_pThreadSessionStateList size and I saw that this list was continuously growing: new threads were added but dead threads were never removed from the list. Hope it helps. |
Thanks @gleocadie! The intended operation is that a thread would reuse a thread-local buffer for ~99.9% of events and only in the 0.1% case would it exhaust the local buffer and need to acquire the global lock to get a new block of memory. Your report is part of growing evidence that either 0.1% is still too much contention or we have bad behavior causing the lock to get accessed more frequently than intended.
That matches my understanding too. As far as I can tell the time the lock is held shouldn't vary depending on the size of that list so I wasn't expecting it to be the root cause here. However its always possible I've overlooked a connection. Even if it isn't the root cause, that growing list could cause other issues and it deserves to be fixed. |
@noahfalk
I agree it's not the root cause and does deserve to be fixed. |
@gleocadie - Yep, I stand corrected! That definitely looks like a spot where the duration the lock is held is proportional to the number of threads that are on that list : ) Even though that acquisition isn't on the event writing path it still contends with the lock acquisitions that are. Nice code spelunking! |
@noahfalk - I'm currently testing a fix where I filter out the session state associated to dead threads. I was thinking (after validating the test) I will push my fix upstream (PR). What do you think ? |
Please do : ) I can't evaluate the merits of a specific implementation without seeing it of course but in principle this is the kind of thing we want to fix. You can feel free to push a PR before you have validated it as well, just mark it as a draft and make a note that you are still testing it. FYI, I might be heads down for a few days here but looping in some other folks on my team who can take a look - @sywhang @davmason @josalem |
@noahfalk - Sorry I did not yet push the PR, my current fix is missing few things (the test was inconclusive) and I'm working on it. I might push it anyway so we can discuss about it (maybe next week). I will ping you and your team when it's done. |
No worries @gleocadie, we can do it whenever you are ready. After next week I'll be on vacation for the remainder of the year but some of my teammates should still be around. |
Thanks, no pb. |
solved via dotnet/runtime#48435 |
We had a report from a user on a machine with a 20 core CPU group that a hot lock in EventPipe (EventPipeBufferManager::m_lock) was causing excessive spinning in EventPipeBufferManager::AllocateBufferForThread. There was conflicting evidence about whether or not the system was sustaining high event load. Regardless of the specifics of that case, we need to validate that EventPipe scales well.
We should create a performance test that will EventPipe use from a large number of cores (>= 20) reading with both EventListener and IPC. We should then compare a low load case (1 event/sec/thread) with a high load case (maximum sustainable event rate on all threads) and compare the latency distribution in calls to WriteEvent. A first stab at a goal is probably that P50 latency increases no more than 10% and P99 latency increases no more than 50%.
The text was updated successfully, but these errors were encountered: