Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-3470 feat: Head-based sampling for local and (automatic) distributed tracing #1783

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Apr 17, 2024

What and why?

📦 This PR implements head-based sampling for the scenarios defined earlier in #1772. It also adds tests for one more class of scenarios: using distributed trace created manually with OTFormatWriter API.

See all scenarios for head-based sampling vs their support level:

  • ✅ Local trace with explicit parent:
client-ios-app:     [---- custom.span.1 -----]
client-ios-app:       [-- custom.span.2 ---]
client-ios-app:         [- custom.span.3 -]
  • ✅ Local trace with implicit parent:
client-ios-app:     [----- active.span ------]
client-ios-app:       [-- custom.span.1 ---]
client-ios-app:         [- custom.span.2 -]
  • ✅ Distributed trace sent via URLSessionInstrumentation:
dd-sdk-ios:         [--- urlsession.request ---]
client backend:        [--- backend span ---]
  • ✅ Distributed trace sent via URLSessionInstrumentation with implicit parent:
client-ios-app:     [-------- active.span ---------]
dd-sdk-ios:           [--- urlsession.request ---]
client backend:          [--- backend span ---]
  • ❌ Distributed trace sent via Tracer API (*HTTPHeadersWriter):
client-ios-app:     [------ network.span -------]
client backend:         [--- backend span ---]
  • ❌ Distributed trace sent via Tracer API (*HTTPHeadersWriter) with implicit parent:
client-ios-app:     [------ active.span -------]
client-ios-app:       [---- network.span ----]
client backend:         [-- backend span --]

The last 2 scenarios are defined as failing (and disabled) tests. They will be supported and enabled in next PR.

How?

I did few changes that enable this feature.

The sampling decision is now a part of DDSpanContext

In OT, creating child span requires passing the OTSpanContext of the parent. This makes a good place for propagating the sampling decision for head-based sampling. For that reason, sampleRate and isKept are added to DDSpanContext.

DDSpanContext (parent): {traceID, spanID, nil, isKept, sampleRate}

                                   ↓

DDSpanContext (child):  {parent.traceID, spanID, parent.spanID, parent.isKept, parent.sampleRate}

Note: While isKept is crucial for sampling the whole trace equally, the sampleRate is not. We need to propagate it, so it can be later sent in SpanEvent for computing BE-side metrics.

The OTFormatReader implementations can now extract the sampling decision

Because the OTTracer can encode and decode the OTSpanContext:

// OTTracer
func inject(spanContext: OTSpanContext, writer: OTFormatWriter)
func extract(reader: OTFormatReader) -> OTSpanContext?

the implementations of HTTP header readers (DD, W3C and B3) were extended to extract the isKept and sampleRate.

Note: The sampleRate is not encoded as header so it needs to be supplied by the tracer in extract(reader:).

NetworkInstrumentationFeature no longer decodes TraceContext from request headers

I enhanced the implementation of passing TraceContext between task interception methods. It now avoids coding it through request headers, which enables passing extra information not available in standard headers (the sampleRate).

Before:

// NetworkInstrumentationFeature
let request = self.intercept(request: currentRequest, ...) // <-- encode TraceContext in request headers
self.intercept(task: task, ...) // <-- decode TraceContext from task.currentRequest headers

Now:

let (request, traceContexts) = self.intercept(request: currentRequest, ...)
self.intercept(task: task, with: traceContexts, ...)

Note: This simplifies and enhances task interception in the main flow. However, it required additional changes in URLSessionInterceptor which is required for Alamofire Extension.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@ncreated ncreated self-assigned this Apr 17, 2024
Comment on lines -731 to +734
let spanContext1 = DDSpanContext(traceID: .init(idHi: 10, idLo: 100), spanID: 200, parentSpanID: .mockAny(), baggageItems: .mockAny())
let spanContext2 = DDSpanContext(traceID: 3, spanID: 4, parentSpanID: .mockAny(), baggageItems: .mockAny())
let spanContext1 = DDSpanContext(traceID: .init(idHi: 10, idLo: 100), spanID: 200, parentSpanID: .mockAny(), baggageItems: .mockAny(), sampleRate: .mockRandom(), isKept: .mockRandom())
let spanContext2 = DDSpanContext(traceID: 3, spanID: 4, parentSpanID: .mockAny(), baggageItems: .mockAny(), sampleRate: .mockRandom(), isKept: .mockRandom())

let httpHeadersWriter = HTTPHeadersWriter(sampler: .mockKeepAll())
let httpHeadersWriter = HTTPHeadersWriter(sampler: Sampler.mockKeepAll())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This surfaces the problem of head-based sampling in manual distributed tracing. No matter of the isKept decision made by tracer for spanContext, distributed trace headers are sampled exclusively using the sampler configured in *HTTPHeadersWriter.

I will work on this in "part 2" of this PR.

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Apr 17, 2024

Datadog Report

Branch report: ncreated/RUM-3470/head-based-sampling-part1
Commit report: c0f09b2
Test service: dd-sdk-ios

✅ 0 Failed, 3004 Passed, 0 Skipped, 12m 19.28s Wall Time
🔻 Test Sessions change in coverage: 13 decreased, 1 no change

🔻 Code Coverage Decreases vs Default Branch (13)

This report shows up to 5 code coverage decreases.

  • test DatadogInternalTests iOS 79.64% (-0.52%) - Details
  • test DatadogInternalTests tvOS 79.67% (-0.47%) - Details
  • test DatadogTraceTests iOS 49.15% (-0.31%) - Details
  • test DatadogRUMTests tvOS 80.73% (-0.3%) - Details
  • test DatadogLogsTests tvOS 45.18% (-0.19%) - Details

@ncreated ncreated force-pushed the ncreated/RUM-3470/head-based-sampling-part1 branch from b552be4 to 39341f0 Compare April 17, 2024 10:52
Comment on lines +121 to +122
sampleRate: parentSpanContext?.sampleRate ?? sampler.samplingRate,
isKept: parentSpanContext?.isKept ?? sampler.sample()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 This is the main point for head-based sampling implementation in this PR:

  • use the sampling decision from the parent span (if available)
  • or determine it based on provided sampler (which can be local or distributed trace sampler).

@ncreated ncreated force-pushed the ncreated/RUM-3470/head-based-sampling-part1 branch from 39341f0 to 8a89706 Compare April 17, 2024 11:39
@ncreated ncreated marked this pull request as ready for review April 17, 2024 12:59
@ncreated ncreated requested review from a team as code owners April 17, 2024 12:59
Identifier = "HeadBasedSamplingTests/testSamplingLocalTrace()">
Identifier = "HeadBasedSamplingTests/testSendingDroppedDistributedTraceWithNoParent_throughTracerAPI()">
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Some tests were renamed, some "red" ones were enabled (now "green") and some new "red" tests were added (disabled) to be fixed in next PR.

Comment on lines -22 to -23
/// Returns the trace of the current execution context.
func traceContext() -> TraceContext?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 No longer needed. It was actually doubling the effort:

  • active span was obtained once in modify(request:) of Trace handler,
  • obtained second time in intercept(task:) in NetworkInstrumentationFeature,
  • it was no-op in RUM handler.

Now we read it only once, in Trace handler and pass the parentSpanID in TraceContext.

@@ -6,6 +6,8 @@

import Foundation

// TODO: RUM-3470 Add tests to `URLSessionInterceptor`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 URLSessionInterception has no tests coverage. I will add tests in next PR. Here, we're merging against feature branch.

Comment on lines -471 to -472
func testGivenOpenTracing_whenInterceptingRequests_itInjectsTrace() throws {
// Given
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 All these tests are not necessary because we removed traceContext() in DatadogURLSessionHandler.

Comment on lines -261 to -262
func testTraceContext_whenInterceptionStarts_withActiveSpan_itReturnCurrentSpan() {
// When
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Not necessary with the removal of traceContext() from DatadogURLSessionHandler.

Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall 👍 but I left a blocking point that we can discuss first.

@ncreated ncreated requested a review from maxep April 19, 2024 09:21
Copy link
Contributor

@ganeshnj ganeshnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good nothing block.

@@ -58,7 +58,7 @@ public class HTTPHeadersWriter: TracePropagationHeadersWriter {
/// Initializes the headers writer.
///
/// - Parameter sampler: The sampler used for headers injection.
public init(sampler: Sampler) {
public init(sampler: Sampling) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


This is a breaking change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, but the Sampler type is only defined in DatadogInternal without being re-exported from user-facing modules. In practise, the only supported public initializer for HTTPHeadersWriter is init(sampleRate:), which is not altered by this change.

@@ -21,3 +30,25 @@ public struct Sampler {
return Float.random(in: 0.0..<100.0) < samplingRate
}
}

/// A sampler that determines sampling decisions deterministically (the same each time).
public struct DeterministicSampler: Sampling {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for inspiration, I think we can multiple samplers like here https://github.com/open-telemetry/opentelemetry-swift/blob/4eb75bc8f0d6449bc197637b5c6044b51ab8c4ed/Sources/OpenTelemetrySdk/Trace/Samplers/Samplers.swift#L10

I'm particularly interested in ParentBasedSampler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganeshnj I'm sure what is the actual ask here 🙂, could you clarify? The SDK doesn't expose any sampler implementation to the end-user (interface + impls exist only in DatadogInternal). Because end-users only work with sampleRate: Float API today, I don't see reason for introducing more samplers than we need for SDK internals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies, it wasn't clear. I mean to say - for our internal working we can use some existing pattern. I wasn't expecting any action on the comment. Just for information if we have tricky cases around sampling decision.

@ganeshnj
Copy link
Contributor

✅ Local trace with implicit parent:

client-ios-app:     [----- active.span ------]
client-ios-app:       [-- custom.span.1 ---]
client-ios-app:         [- custom.span.2 -]

This must be rather from parent-child point of view.

client-ios-app:     [----- active.span ------]
client-ios-app:       [-- custom.span.1 ---]
client-ios-app:       [-- custom.span.2 ---]

@ncreated
Copy link
Member Author

✅ Local trace with implicit parent:

client-ios-app:     [----- active.span ------]
client-ios-app:       [-- custom.span.1 ---]
client-ios-app:         [- custom.span.2 -]

This must be rather from parent-child point of view.

client-ios-app:     [----- active.span ------]
client-ios-app:       [-- custom.span.1 ---]
client-ios-app:       [-- custom.span.2 ---]

It already is:

 client-ios-app:     [-------- active.span -----]   |
 client-ios-app:       [- child1 -][- child2 -]     | all 3: keep or drop

^ one parent with two childs

@ncreated ncreated requested a review from ganeshnj April 22, 2024 07:51
Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

@ncreated ncreated merged commit c452a2d into ncreated/RUM-3470/head-based-sampling Apr 22, 2024
8 checks passed
@ncreated ncreated deleted the ncreated/RUM-3470/head-based-sampling-part1 branch April 22, 2024 14:59
@maciejburda maciejburda mentioned this pull request May 7, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants