profiler: avoid metrics profile log noise when stopping profiling #2865

nsrip-dd · 2024-09-13T14:18:00Z

What does this PR do?

The metrics profiler insisted on at least one second between collections
for two reasons:

To avoid a division by zero because it was doing integer division to
convert a time.Duration to seconds, which will truncate to 0, as a
ratio in a subsequent computation
In case "a system clock issue causes time to run backwards"

The profiler would report an error if less than one second elapsed
between collections. In practice, this resulted in misleading error logs
because it's entirely likely for profiling to be stopped less than a
second after the last profile collection.

The restriction was not really even needed. For one, we can just do
floating-point division rather than integer division to avoid the
truncation problem.

Also, Go has had monotonic time support by default since 2017, added in
Go 1.9, and time comparison operations including time.Time.Sub, work
with respect to monotonic time. We shouldn't have any issues with
negative periods. We can ensure the period is positive just as a
defensive measure, and fail if it's negative since this may indicate a
bug in the Go runtime if it's violating the monotonicity guarantees.

Motivation

Reduce log noise. This log has been misleading in past escalations, with
users/support assuming the error was relevant to the actual issue being
investigated.

Fixes #2863

pr-commenter · 2024-09-13T14:50:52Z

Benchmarks

Benchmark execution time: 2024-09-24 15:47:06

Comparing candidate commit 9bd35fb in PR branch nick.ripley/fix-metrics-profile-error with baseline commit 7699f9e in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 57 metrics, 1 unstable metrics.

scenario:BenchmarkInjectW3C-24

🟩 execution_time [-171.150ns; -140.050ns] or [-4.095%; -3.351%]

The metrics profiler insisted on at least one second between collections for two reasons: - To avoid a division by zero because it was doing integer division to convert a time.Duration to seconds, which will truncate to 0, as a ratio in a subsequent computation - In case "a system clock issue causes time to run backwards" The profiler would report an error if less than one second elapsed between collections. In practice, this resulted in misleading error logs because it's entirely likely for profiling to be stopped less than a second after the last profile collection. The restriction was not really even needed. For one, we can just do floating-point division rather than integer division to avoid the truncation problem. Also, Go has had monotonic time support by default since 2017, added in Go 1.9, and time comparison operations including time.Time.Sub, work with respect to monotonic time. We shouldn't have any issues with negative periods. We can ensure the period is positive just as a defensive measure, and fail if it's negative since this may indicate a bug in the Go runtime if it's violating the monotonicity guarantees. Fixes #2863

felixge

LGTM. Thanks for doing this 🙇. @pmbauer might have more historical context on this IIRC.

What's up with the perf regressions that are being reported, seems like false positives? 🤔

nsrip-dd · 2024-09-24T15:49:09Z

What's up with the perf regressions that are being reported, seems like false positives? 🤔

Yeah, my wild guess is it's alignment or some other small perturbation. After pulling in main again the "regression" went away and we even have an "improvement". 🤷

The Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from #2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. Drop TestMetricsProfileStopEarlyNoLog because it's not going to be useful if the timer resolution is that low. Bump the period in TestShortMetricsProfile from 10ms to 20ms so that the Windows timer will (hopefully) be able to actually measure the duration.

The fix in #2865 was intended to suppress spurious metrics profile errors when the profiler is stopped. It did so by relaxing the one-second duration constraint of the metrics profiler. However, the Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from #2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. This commit actually suppresses the error by checking whether the profiler was stopped (meaning interruptibleSleep was interrupted). If so, and if the metrics profiler returned an error, we instead return a sentinel error indicating that profiling was stopped. If we see that error, we just drop the profile and don't log an error. We won't upload the profile anyway. This way, we should only report an error from the metrics profiler if there is _actually_ a problem with the timer.

…taDog#2865) The metrics profiler insisted on at least one second between collections for two reasons: - To avoid a division by zero because it was doing integer division to convert a time.Duration to seconds, which will truncate to 0, as a ratio in a subsequent computation - In case "a system clock issue causes time to run backwards" The profiler would report an error if less than one second elapsed between collections. In practice, this resulted in misleading error logs because it's entirely likely for profiling to be stopped less than a second after the last profile collection. The restriction was not really even needed. For one, we can just do floating-point division rather than integer division to avoid the truncation problem. Also, Go has had monotonic time support by default since 2017, added in Go 1.9, and time comparison operations including time.Time.Sub, work with respect to monotonic time. We shouldn't have any issues with negative periods. We can ensure the period is positive just as a defensive measure, and fail if it's negative since this may indicate a bug in the Go runtime if it's violating the monotonicity guarantees. Fixes DataDog#2863

The fix in DataDog#2865 was intended to suppress spurious metrics profile errors when the profiler is stopped. It did so by relaxing the one-second duration constraint of the metrics profiler. However, the Windows system timer resolution is about 15 milliseconds (see https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy) This caused the metrics profile tests from DataDog#2865 to fail because the metrics profiler will likely be stopped in less than 15 milliseconds, meaning we'll see 0 duration between profile collection and log an error. This commit actually suppresses the error by checking whether the profiler was stopped (meaning interruptibleSleep was interrupted). If so, and if the metrics profiler returned an error, we instead return a sentinel error indicating that profiling was stopped. If we see that error, we just drop the profile and don't log an error. We won't upload the profile anyway. This way, we should only report an error from the metrics profiler if there is _actually_ a problem with the timer.

nsrip-dd added the profiler label Sep 13, 2024

nsrip-dd force-pushed the nick.ripley/fix-metrics-profile-error branch from 6273290 to f5d8369 Compare September 19, 2024 18:28

nsrip-dd marked this pull request as ready for review September 19, 2024 18:39

nsrip-dd requested a review from a team as a code owner September 19, 2024 18:39

felixge approved these changes Sep 24, 2024

View reviewed changes

Merge branch 'main' into nick.ripley/fix-metrics-profile-error

9bd35fb

nsrip-dd merged commit 101d4da into main Sep 24, 2024
145 checks passed

nsrip-dd deleted the nick.ripley/fix-metrics-profile-error branch September 24, 2024 15:49

nsrip-dd mentioned this pull request Sep 24, 2024

profiler: suppress errors if the profiler is stopped #2886

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiler: avoid metrics profile log noise when stopping profiling #2865

profiler: avoid metrics profile log noise when stopping profiling #2865

nsrip-dd commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 13, 2024 •

edited

Loading

felixge left a comment

nsrip-dd commented Sep 24, 2024

profiler: avoid metrics profile log noise when stopping profiling #2865

profiler: avoid metrics profile log noise when stopping profiling #2865

Conversation

nsrip-dd commented Sep 13, 2024 • edited Loading

What does this PR do?

Motivation

pr-commenter bot commented Sep 13, 2024 • edited Loading

Benchmarks

scenario:BenchmarkInjectW3C-24

felixge left a comment

Choose a reason for hiding this comment

nsrip-dd commented Sep 24, 2024

nsrip-dd commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 13, 2024 •

edited

Loading