-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiler: avoid metrics profile log noise when stopping profiling #2865
Conversation
BenchmarksBenchmark execution time: 2024-09-24 15:47:06 Comparing candidate commit 9bd35fb in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 57 metrics, 1 unstable metrics. scenario:BenchmarkInjectW3C-24
|
The metrics profiler insisted on at least one second between collections for two reasons: - To avoid a division by zero because it was doing integer division to convert a time.Duration to seconds, which will truncate to 0, as a ratio in a subsequent computation - In case "a system clock issue causes time to run backwards" The profiler would report an error if less than one second elapsed between collections. In practice, this resulted in misleading error logs because it's entirely likely for profiling to be stopped less than a second after the last profile collection. The restriction was not really even needed. For one, we can just do floating-point division rather than integer division to avoid the truncation problem. Also, Go has had monotonic time support by default since 2017, added in Go 1.9, and time comparison operations including time.Time.Sub, work with respect to monotonic time. We shouldn't have any issues with negative periods. We can ensure the period is positive just as a defensive measure, and fail if it's negative since this may indicate a bug in the Go runtime if it's violating the monotonicity guarantees. Fixes #2863
6273290
to
f5d8369
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for doing this 🙇. @pmbauer might have more historical context on this IIRC.
What's up with the perf regressions that are being reported, seems like false positives? 🤔
Yeah, my wild guess is it's alignment or some other small perturbation. After pulling in main again the "regression" went away and we even have an "improvement". 🤷 |
The Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from #2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. Drop TestMetricsProfileStopEarlyNoLog because it's not going to be useful if the timer resolution is that low. Bump the period in TestShortMetricsProfile from 10ms to 20ms so that the Windows timer will (hopefully) be able to actually measure the duration.
The fix in #2865 was intended to suppress spurious metrics profile errors when the profiler is stopped. It did so by relaxing the one-second duration constraint of the metrics profiler. However, the Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from #2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. This commit actually suppresses the error by checking whether the profiler was stopped (meaning interruptibleSleep was interrupted). If so, and if the metrics profiler returned an error, we instead return a sentinel error indicating that profiling was stopped. If we see that error, we just drop the profile and don't log an error. We won't upload the profile anyway. This way, we should only report an error from the metrics profiler if there is _actually_ a problem with the timer.
The fix in #2865 was intended to suppress spurious metrics profile errors when the profiler is stopped. It did so by relaxing the one-second duration constraint of the metrics profiler. However, the Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from #2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. This commit actually suppresses the error by checking whether the profiler was stopped (meaning interruptibleSleep was interrupted). If so, and if the metrics profiler returned an error, we instead return a sentinel error indicating that profiling was stopped. If we see that error, we just drop the profile and don't log an error. We won't upload the profile anyway. This way, we should only report an error from the metrics profiler if there is _actually_ a problem with the timer.
The fix in #2865 was intended to suppress spurious metrics profile errors when the profiler is stopped. It did so by relaxing the one-second duration constraint of the metrics profiler. However, the Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from #2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. This commit actually suppresses the error by checking whether the profiler was stopped (meaning interruptibleSleep was interrupted). If so, and if the metrics profiler returned an error, we instead return a sentinel error indicating that profiling was stopped. If we see that error, we just drop the profile and don't log an error. We won't upload the profile anyway. This way, we should only report an error from the metrics profiler if there is _actually_ a problem with the timer.
The fix in #2865 was intended to suppress spurious metrics profile errors when the profiler is stopped. It did so by relaxing the one-second duration constraint of the metrics profiler. However, the Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from #2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. This commit actually suppresses the error by checking whether the profiler was stopped (meaning interruptibleSleep was interrupted). If so, and if the metrics profiler returned an error, we instead return a sentinel error indicating that profiling was stopped. If we see that error, we just drop the profile and don't log an error. We won't upload the profile anyway. This way, we should only report an error from the metrics profiler if there is _actually_ a problem with the timer.
…taDog#2865) The metrics profiler insisted on at least one second between collections for two reasons: - To avoid a division by zero because it was doing integer division to convert a time.Duration to seconds, which will truncate to 0, as a ratio in a subsequent computation - In case "a system clock issue causes time to run backwards" The profiler would report an error if less than one second elapsed between collections. In practice, this resulted in misleading error logs because it's entirely likely for profiling to be stopped less than a second after the last profile collection. The restriction was not really even needed. For one, we can just do floating-point division rather than integer division to avoid the truncation problem. Also, Go has had monotonic time support by default since 2017, added in Go 1.9, and time comparison operations including time.Time.Sub, work with respect to monotonic time. We shouldn't have any issues with negative periods. We can ensure the period is positive just as a defensive measure, and fail if it's negative since this may indicate a bug in the Go runtime if it's violating the monotonicity guarantees. Fixes DataDog#2863
The fix in DataDog#2865 was intended to suppress spurious metrics profile errors when the profiler is stopped. It did so by relaxing the one-second duration constraint of the metrics profiler. However, the Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from DataDog#2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. This commit actually suppresses the error by checking whether the profiler was stopped (meaning interruptibleSleep was interrupted). If so, and if the metrics profiler returned an error, we instead return a sentinel error indicating that profiling was stopped. If we see that error, we just drop the profile and don't log an error. We won't upload the profile anyway. This way, we should only report an error from the metrics profiler if there is _actually_ a problem with the timer.
What does this PR do?
The metrics profiler insisted on at least one second between collections
for two reasons:
convert a time.Duration to seconds, which will truncate to 0, as a
ratio in a subsequent computation
The profiler would report an error if less than one second elapsed
between collections. In practice, this resulted in misleading error logs
because it's entirely likely for profiling to be stopped less than a
second after the last profile collection.
The restriction was not really even needed. For one, we can just do
floating-point division rather than integer division to avoid the
truncation problem.
Also, Go has had monotonic time support by default since 2017, added in
Go 1.9, and time comparison operations including time.Time.Sub, work
with respect to monotonic time. We shouldn't have any issues with
negative periods. We can ensure the period is positive just as a
defensive measure, and fail if it's negative since this may indicate a
bug in the Go runtime if it's violating the monotonicity guarantees.
Motivation
Reduce log noise. This log has been misleading in past escalations, with
users/support assuming the error was relevant to the actual issue being
investigated.
Fixes #2863