Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiler: avoid metrics profile log noise when stopping profiling #2865

Merged
merged 2 commits into from
Sep 24, 2024

Conversation

nsrip-dd
Copy link
Contributor

@nsrip-dd nsrip-dd commented Sep 13, 2024

What does this PR do?

The metrics profiler insisted on at least one second between collections
for two reasons:

  • To avoid a division by zero because it was doing integer division to
    convert a time.Duration to seconds, which will truncate to 0, as a
    ratio in a subsequent computation
  • In case "a system clock issue causes time to run backwards"

The profiler would report an error if less than one second elapsed
between collections. In practice, this resulted in misleading error logs
because it's entirely likely for profiling to be stopped less than a
second after the last profile collection.

The restriction was not really even needed. For one, we can just do
floating-point division rather than integer division to avoid the
truncation problem.

Also, Go has had monotonic time support by default since 2017, added in
Go 1.9, and time comparison operations including time.Time.Sub, work
with respect to monotonic time. We shouldn't have any issues with
negative periods. We can ensure the period is positive just as a
defensive measure, and fail if it's negative since this may indicate a
bug in the Go runtime if it's violating the monotonicity guarantees.

Motivation

Reduce log noise. This log has been misleading in past escalations, with
users/support assuming the error was relevant to the actual issue being
investigated.

Fixes #2863

@pr-commenter
Copy link

pr-commenter bot commented Sep 13, 2024

Benchmarks

Benchmark execution time: 2024-09-24 15:47:06

Comparing candidate commit 9bd35fb in PR branch nick.ripley/fix-metrics-profile-error with baseline commit 7699f9e in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 57 metrics, 1 unstable metrics.

scenario:BenchmarkInjectW3C-24

  • 🟩 execution_time [-171.150ns; -140.050ns] or [-4.095%; -3.351%]

The metrics profiler insisted on at least one second between collections
for two reasons:

- To avoid a division by zero because it was doing integer division to
  convert a time.Duration to seconds, which will truncate to 0, as a
  ratio in a subsequent computation
- In case "a system clock issue causes time to run backwards"

The profiler would report an error if less than one second elapsed
between collections. In practice, this resulted in misleading error logs
because it's entirely likely for profiling to be stopped less than a
second after the last profile collection.

The restriction was not really even needed. For one, we can just do
floating-point division rather than integer division to avoid the
truncation problem.

Also, Go has had monotonic time support by default since 2017, added in
Go 1.9, and time comparison operations including time.Time.Sub, work
with respect to monotonic time. We shouldn't have any issues with
negative periods. We can ensure the period is positive just as a
defensive measure, and fail if it's negative since this may indicate a
bug in the Go runtime if it's violating the monotonicity guarantees.

Fixes #2863
@nsrip-dd nsrip-dd force-pushed the nick.ripley/fix-metrics-profile-error branch from 6273290 to f5d8369 Compare September 19, 2024 18:28
@nsrip-dd nsrip-dd marked this pull request as ready for review September 19, 2024 18:39
@nsrip-dd nsrip-dd requested a review from a team as a code owner September 19, 2024 18:39
Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for doing this 🙇. @pmbauer might have more historical context on this IIRC.

What's up with the perf regressions that are being reported, seems like false positives? 🤔

@nsrip-dd
Copy link
Contributor Author

What's up with the perf regressions that are being reported, seems like false positives? 🤔

Yeah, my wild guess is it's alignment or some other small perturbation. After pulling in main again the "regression" went away and we even have an "improvement". 🤷

@nsrip-dd nsrip-dd merged commit 101d4da into main Sep 24, 2024
145 checks passed
@nsrip-dd nsrip-dd deleted the nick.ripley/fix-metrics-profile-error branch September 24, 2024 15:49
nsrip-dd added a commit that referenced this pull request Sep 24, 2024
The Windows system timer resolution is about 15 milliseconds (see
https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy)
This caused the metrics profile tests from #2865 to fail because the
metrics profiler will likely be stopped in less than 15 milliseconds,
meaning we'll see 0 duration between profile collection and log an
error. Drop TestMetricsProfileStopEarlyNoLog because it's not going to
be useful if the timer resolution is that low. Bump the period in
TestShortMetricsProfile from 10ms to 20ms so that the Windows timer will
(hopefully) be able to actually measure the duration.
nsrip-dd added a commit that referenced this pull request Sep 24, 2024
The fix in #2865 was intended to suppress spurious metrics profile
errors when the profiler is stopped. It did so by relaxing the
one-second duration constraint of the metrics profiler. However,
the Windows system timer resolution is about 15 milliseconds (see
https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy)
This caused the metrics profile tests from #2865 to fail because the
metrics profiler will likely be stopped in less than 15 milliseconds,
meaning we'll see 0 duration between profile collection and log an
error.

This commit actually suppresses the error by checking whether the
profiler was stopped (meaning interruptibleSleep was interrupted). If
so, and if the metrics profiler returned an error, we instead return a
sentinel error indicating that profiling was stopped. If we see that
error, we just drop the profile and don't log an error. We won't upload
the profile anyway. This way, we should only report an error from the
metrics profiler if there is _actually_ a problem with the timer.
nsrip-dd added a commit that referenced this pull request Sep 24, 2024
The fix in #2865 was intended to suppress spurious metrics profile
errors when the profiler is stopped. It did so by relaxing the
one-second duration constraint of the metrics profiler. However,
the Windows system timer resolution is about 15 milliseconds (see
https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy)
This caused the metrics profile tests from #2865 to fail because the
metrics profiler will likely be stopped in less than 15 milliseconds,
meaning we'll see 0 duration between profile collection and log an
error.

This commit actually suppresses the error by checking whether the
profiler was stopped (meaning interruptibleSleep was interrupted). If
so, and if the metrics profiler returned an error, we instead return a
sentinel error indicating that profiling was stopped. If we see that
error, we just drop the profile and don't log an error. We won't upload
the profile anyway. This way, we should only report an error from the
metrics profiler if there is _actually_ a problem with the timer.
nsrip-dd added a commit that referenced this pull request Sep 24, 2024
The fix in #2865 was intended to suppress spurious metrics profile
errors when the profiler is stopped. It did so by relaxing the
one-second duration constraint of the metrics profiler. However,
the Windows system timer resolution is about 15 milliseconds (see
https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy)
This caused the metrics profile tests from #2865 to fail because the
metrics profiler will likely be stopped in less than 15 milliseconds,
meaning we'll see 0 duration between profile collection and log an
error.

This commit actually suppresses the error by checking whether the
profiler was stopped (meaning interruptibleSleep was interrupted). If
so, and if the metrics profiler returned an error, we instead return a
sentinel error indicating that profiling was stopped. If we see that
error, we just drop the profile and don't log an error. We won't upload
the profile anyway. This way, we should only report an error from the
metrics profiler if there is _actually_ a problem with the timer.
nsrip-dd added a commit that referenced this pull request Sep 25, 2024
The fix in #2865 was intended to suppress spurious metrics profile
errors when the profiler is stopped. It did so by relaxing the
one-second duration constraint of the metrics profiler. However,
the Windows system timer resolution is about 15 milliseconds (see
https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy)
This caused the metrics profile tests from #2865 to fail because the
metrics profiler will likely be stopped in less than 15 milliseconds,
meaning we'll see 0 duration between profile collection and log an
error.

This commit actually suppresses the error by checking whether the
profiler was stopped (meaning interruptibleSleep was interrupted). If
so, and if the metrics profiler returned an error, we instead return a
sentinel error indicating that profiling was stopped. If we see that
error, we just drop the profile and don't log an error. We won't upload
the profile anyway. This way, we should only report an error from the
metrics profiler if there is _actually_ a problem with the timer.
MNThomson pushed a commit to agilebits/dd-trace-go that referenced this pull request Oct 21, 2024
…taDog#2865)

The metrics profiler insisted on at least one second between collections
for two reasons:

- To avoid a division by zero because it was doing integer division to
  convert a time.Duration to seconds, which will truncate to 0, as a
  ratio in a subsequent computation
- In case "a system clock issue causes time to run backwards"

The profiler would report an error if less than one second elapsed
between collections. In practice, this resulted in misleading error logs
because it's entirely likely for profiling to be stopped less than a
second after the last profile collection.

The restriction was not really even needed. For one, we can just do
floating-point division rather than integer division to avoid the
truncation problem.

Also, Go has had monotonic time support by default since 2017, added in
Go 1.9, and time comparison operations including time.Time.Sub, work
with respect to monotonic time. We shouldn't have any issues with
negative periods. We can ensure the period is positive just as a
defensive measure, and fail if it's negative since this may indicate a
bug in the Go runtime if it's violating the monotonicity guarantees.

Fixes DataDog#2863
MNThomson pushed a commit to agilebits/dd-trace-go that referenced this pull request Oct 21, 2024
The fix in DataDog#2865 was intended to suppress spurious metrics profile
errors when the profiler is stopped. It did so by relaxing the
one-second duration constraint of the metrics profiler. However,
the Windows system timer resolution is about 15 milliseconds (see
https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/high-resolution-timers#controlling-timer-accuracy)
This caused the metrics profile tests from DataDog#2865 to fail because the
metrics profiler will likely be stopped in less than 15 milliseconds,
meaning we'll see 0 duration between profile collection and log an
error.

This commit actually suppresses the error by checking whether the
profiler was stopped (meaning interruptibleSleep was interrupted). If
so, and if the metrics profiler returned an error, we instead return a
sentinel error indicating that profiling was stopped. If we see that
error, we just drop the profile and don't log an error. We won't upload
the profile anyway. This way, we should only report an error from the
metrics profiler if there is _actually_ a problem with the timer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Error getting metrics profile: period between metrics collection is too small
2 participants