ActivityId missing in a newly added HttpTelemetry Stop events. #40776

ManickaP · 2020-08-13T14:42:09Z

In non-trivial number of cases, the RequestStop event has empty activity id.

9 945 out of 35 452 requests sent during HTTP FunctionalTests were missing activity id.

Other Start/Stop event pairs introduced with the new telemetry are not properly linked with ActivityId as well.

Also easily reproducible with a simple HttpClient scenario:

class Program
{
    static async Task Main(string[] args)
    {
        using var listener = new HttpEventListener();
        using var client = new HttpClient();
        
        await client.GetAsync(@"https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Http/src/HttpDiagnosticsGuide.md");
    }
}

internal sealed class HttpEventListener : EventListener
{
    public const EventKeywords TasksFlowActivityIds = (EventKeywords)0x80;
    //public const EventKeywords Debug = (EventKeywords)0x20000;

    protected override void OnEventSourceCreated(EventSource eventSource)
    {
        if (eventSource.Name == "System.Net.Http" ||
            eventSource.Name == "System.Net.Sockets" ||
            eventSource.Name == "System.Net.Security" ||
            eventSource.Name == "System.Net.NameResolution")
        {
            EnableEvents(eventSource, EventLevel.LogAlways);
        }
        else if (eventSource.Name == "System.Threading.Tasks.TplEventSource")
        {
            EnableEvents(eventSource, EventLevel.LogAlways, TasksFlowActivityIds /*| Debug*/);
        }
    }

    protected override void OnEventWritten(EventWrittenEventArgs eventData)
    {
        var sb = new StringBuilder().Append($"{eventData.TimeStamp:HH:mm:ss.fffffff}[{eventData.EventName}]{eventData.ActivityId}.{eventData.RelatedActivityId} ");
        for (int i = 0; i < eventData.Payload?.Count; i++)
        {
            if (i > 0)
                sb.Append(", ");
            sb.Append(eventData.PayloadNames?[i]).Append(": ").Append(eventData.Payload[i]);
        }
        Console.WriteLine(sb.ToString());
    }
}

Result (with the change in #41022)

09:34:45.3129327[RequestStart]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000 scheme: https, host: github.com, port: 443, pathAndQuery: /dotnet/runtime/blob/master/src/libraries/System.Net.Http/src/HttpDiagnosticsGuide.md, versionMajor: 1, versionMinor: 1
09:34:45.4344957[ResolutionStart]00001011-0000-0000-0000-0000090f9d59.00000011-0000-0000-0000-0000093f9d59 hostNameOrAddress: github.com
09:34:45.4752982[ResolutionStop]00001011-0000-0000-0000-0000090f9d59.00000000-0000-0000-0000-000000000000 
09:34:45.4848747[ConnectStart]00002011-0000-0000-0000-0000091f9d59.00000011-0000-0000-0000-0000093f9d59 address: InterNetworkV6:28:{1,187,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255,255,140,82,118,3,0,0,0,0}
09:34:45.5208470[ConnectStop]00002011-0000-0000-0000-0000091f9d59.00000000-0000-0000-0000-000000000000 
09:34:45.5347626[HandshakeStart]00003011-0000-0000-0000-0000096f9d59.00000011-0000-0000-0000-0000093f9d59 isServer: False, targetHost: github.com
09:34:45.8030641[HandshakeStop]00003011-0000-0000-0000-0000096f9d59.00000000-0000-0000-0000-000000000000 protocol: Tls13
09:34:45.8070942[Http11ConnectionEstablished]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000 
09:34:46.2040073[ResponseHeadersBegin]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000 
09:34:46.3278444[RequestStop]00000011-0000-0000-0000-0000093f9d59.00000000-0000-0000-0000-000000000000

~~Un-paired Stop events:~~
~~- ResolutionStop: ran in continuation: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.NameResolution/src/System/Net/Dns.cs#L485~~
~~- ConnectStop: ran in SocketsEventArgs: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs#L744~~
~~- HandshakeStop: ran in continuation: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs#L240~~
They're paired, I misinterpreted the logs. The Stop event doesn't carry parent activity id, only the start one.

What needs to be done to achieve desired statistics:

We can also add or leave existing unpaired telemetry (though it shouldn't be named Start/Stop then) for events that cannot be bound by a single async context. We can link them to the Start/Stop events manually either via remembering ActivityId or using some other custom ID.

cc: @samsp-msft @MihaZupan @karelz

The text was updated successfully, but these errors were encountered:

ghost · 2020-08-13T14:42:14Z

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

ghost · 2020-08-13T14:42:27Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

ManickaP · 2020-08-14T08:22:54Z

I might have found the root cause of this problem.
When I fetch an HTTP response without content, the ActivityId is present in the Stop event.

The RequestStop is called only after the response has been read to the end, which happens outside of the async scope started when RequestStart was called.

Details:

RequestStart is called in HttpConnectionPoolManager, several levels down from HttpClient.SendAsync.
RequestStop is called in HttpRequestMessage called when the response content stream has been read: e.g.: ContentLengthReadStream. Which is either called by the user, or in case of response buffering, in HttpClient.FinishSendAsync.

Several options how to solve it:

track only request part of request / response exchange, i.e. call stop sooner.
- maybe add separate tracking for response (no idea if it's doable)
manually handle activity id and pass it to the stop event (no idea if it's doable)
keep it as it is, pairing won't work
- manually add custom identifier into the logging so at least something can be used

noahfalk · 2020-08-14T08:52:10Z

@ManickaP nice investigating 👍
Indeed as you observed, ActivityId is async-local so it flows from parent to child when creating a Task but changes do not flow back from child to parent when an await completes.

I'm not knowledgeable about the networking code, but is there an option to call Start() sooner so that it runs in the same Task where Stop() will eventually be called? Other more customized solutions may also be possible, but I've never attempted to author one and my suspicion is it would get messy.

ManickaP · 2020-08-14T10:02:41Z

is there an option to call Start() sooner so that it runs in the same Task where Stop() will eventually be called?

We could, theoretically, run it from HttpClient.SendAsync, but that would only help us with buffered responses. In the other case, the request gets finished after the user either disposes it or reads the response stream till the end and we have no control over when and where this happens.

Is there a way to manually retrieve and later attach the ActivityId to the stop event?

MihaZupan · 2020-08-14T11:11:35Z

As @ManickaP said, we could only do that for buffered responses. For unbuffered ones, we hand out the response that the user can interact with that will eventaully trigger a Stop event.

If I'm understanding this correctly, we can't flow the ActivityID up to the parent?
In that case, is the HttpClient design (handing out the response to the caller to process) incompatible with a Start/Stop activity tracking the entire processing lifetime?

If we give up on tracking the response portion, we would cut into the value of the activity. At that point we would be telling users to log their own Start/Stop activity enclosing the processing of a request.

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense? In that can we could ensure that everything happens in the same scope.

Something along the lines of (just a rough idea):

// On HttpClient
Task<HttpResponseMessage> SendAndProcessWithTelemetryAsync(HttpRequestMessage request, HttpCompletionOption completionOption, Func<HttpResponseMessage, Task> responseCallback, CancellationToken cancellationToken)
{
    TelemetryStart();
    try
    {
        HttpResponseMessage response = await this.SendAsync(request, completionOption, cancellationToken);
        await responseCallback(response);
        TelemetryStop();
        return response;
    }
    catch
    {
        TelemetryAbort(); // Which also logs Stop
        throw;
    }
}

ManickaP · 2020-08-14T11:20:40Z

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense? In that can we could ensure that everything happens in the same scope.

That would measure/trace different things depending of HttpCompletionOptions, for buffered it'd include reading the response, for read-headers it wouldn't.
It's similar the first solution I suggested in here: #40776 (comment).

MihaZupan · 2020-08-14T11:26:08Z

for read-headers it wouldn't.

The idea is that it would be captured in the user-provided callback.

Essentially if a user had code like

Task DoSomethingWithResponseAsync(HttpResponseMessage response) {}

var response = await client.SendAsync(request, completionOption);
await DoSomethingWithResponseAsync(response);

it would turn into

await client.SendAndProcessWithTelemetryAsync(request, completionOption, DoSomethingWithResponseAsync);

That way the Telemetry Start/Stop would be in the enclosing scope of the response processing.

ManickaP · 2020-08-14T11:29:36Z

Now I see.

KalleOlaviNiemitalo · 2020-08-14T11:44:45Z

Is there a way to manually retrieve and later attach the ActivityId to the stop event?

One could save EventSource.CurrentThreadActivityId after the start event, use EventSource.SetCurrentThreadActivityId to temporarily change the activity ID before and after the stop event, and set EventActivityOptions.Disable on the stop event to prevent ActivityTracker from overriding the activity ID. However, if the request causes any other activities, then those would not get the correct parent in their hierarchical GUIDs, so the async solution seems safer.

Related: dotnet/diagnostics#1347

stephentoub · 2020-08-14T12:17:58Z

Alternatively, would introducing a "SendAndProcessWithTelemetryAsync" method that takes a callback make any sense?

Good question, but no :-) We do not want to encourage that style of programming, and it won't fix the 99% case (on top of which it's more expensive).

noahfalk · 2020-08-14T12:25:38Z

One could save EventSource.CurrentThreadActivityId after the start event ...

What @KalleOlaviNiemitalo suggests would work to set the activity id of the stop event, however there is also the consideration of what activity id is attached to all events that happen to occur in between the start and stop. Perhaps we are in position to know that we can control all of those events because we own 100% of the code that executes between those two points? I didn't get a chance to read all the updates above and it is almost 5:30am so I need to sleep before I'll be able to do a good job thinking about this. When I'm awake again I'll dig into it further : )

ManickaP · 2020-08-14T12:37:12Z

We can move the RequestStart somewhere within the start of HttpClient.SendAsync. This should put all of our other activities in the same async scope. After that, the control is in user's hands, we don't need to track those activities, we just need to link the RequestStop once it happens. For that, we can use the SetCurrentThreadActivityId. The only things that would be excluded are content reading methods (LoadIntoBuffer, SerializeToStream, CreateContentReadStream, etc.), which, AFAIK, don't have any activities, so we should be fine.

All of this is based on premise, that tracing the request until the whole response in consumed is what we want. Alternatively, we could narrow the scope only to the point of creating the response and exclude the response content reading.

samsp-msft · 2020-08-17T21:12:57Z

If the response is unbuffered, then we should make the stop be when we hand back control to the user code, as its not ultimately us in control.
If we want to make start/stop events for this that can be correlated, we could create a new event with its own activity ID, that is not parented (so as not to break the async generated IDs) and then reference either it or the other ID in one of other of the payloads, so somebody can track it if they need to. I would put this uber-event at a lower priority for being completed. It may not make sense to be an activity ID, just some other ID that can be used to correlate.

noahfalk · 2020-08-17T22:22:05Z

In that case, is the HttpClient design (handing out the response to the caller to process) incompatible with a Start/Stop activity tracking the entire processing lifetime?

Yes (at least in the cases where the caller will still need to download the remainder of the response after SendAsync task completes). Apologies for not recognizing the potential for this issue earlier.

The main premises of Activities are:

Contiguous time periods of work on a logical thread (an async Task and any Task children). This requires them to have a start time, an end time, and an implied duration between.
They nest. When a logical thread starts working on a new activity, the new activity needs to finish before work resumes on the parent.

Even if we are capable creating events that don't adhere to these rules, the tools doing analysis are likely to struggle because we expect Activities form a tree. So a generic pattern that is hard for the library to represent with a single Activity looks like this:

class Worker
{
    Task DoWorkPart1(...); // in our case this is HttpClient.SendAsync()
    Task DoWorkPart2(...); // and Stream.ReadAsync() on the returned stream
}

A user can consume that API and do very not-treeish things with it:

List<string> workList = new List<string>();
foreach(string work in workList) { await Worker.DoWorkPart1(work); }
foreach(string work in workList) { await Worker.DoWorkPart2(work); }

The individual parts of the work can be sequenced/nested, but if part1 and part2 are combined into a single activity they don't nest:

time ---->

| work 1.1 | work 2.1 | work 3.1 |  work 1.2 | work 2.2 | work 3.2 |
< -------------- work 1 --------------------->
            <------------------ work 2 ----------------->
                       < ------------------ work 3 ---------------->

However, there are still options we could look at to improve the situation, now and/or in the future:

Create additional (optional) Activities with start/stop events when the response stream is read. We could also create a correlation id so that analysis tools can connect those reads back to the SendAsync which initiated the request. Stream.GetHashCode() might be unique enough to use? (this mirrors @samsp-msft suggestion above)
Collaborate with Visual Studio profiler and/or perfview to add logic correlating the different chunks of work and improve analysis/UI. This type of IO analysis seems likely to be useful so if there isn't already some existing support I can advocate that we build it.
Scope down the activity created by SendAsync so that it sends the stop event at the end of SendAsync.
Create single activity scopes for situations where the library can enforce that all the work happens contiguously, for example GetByteArrayAsync or GetStringAsync
Provide guidance to higher level libraries and app developers showing them how they can create single activities that group together multiple portions of the work when they are in position to know it happens contiguously.

I also want to loop in @brianrob who has probably had more experience with this type of issue than I have. He may know of some pre-existing approaches that handle this better without needing to do anything novel. Hope this helps!

samsp-msft · 2020-08-17T22:35:38Z

For 5, I'd recommend we scope down and send the stop at the end of SendAsync, and make it clear either with the event name, or parameters in the payload as to what type of request it is, and so what the Stop will actually correspond to.

samsp-msft · 2020-08-19T17:02:56Z

FYI: For your test app, make sure you aren’t running under the debugger. VS TPL analysis can conflict with the activity tracking – its something Noah mentioned to me when I was looking at some other event stuff. Sam

MihaZupan · 2020-08-31T10:24:03Z

Closing as resolved by #41022

karelz · 2020-09-01T10:27:04Z

Fixed in 6.0/master (PR #41022) and in 5.0 (PR #41610).

Dotnet-GitSync-Bot added area-System.Diagnostics.Tracing untriaged New issue has not been triaged by the area owner labels Aug 13, 2020

ManickaP added the area-System.Net.Http label Aug 13, 2020

karelz added this to the 5.0.0 milestone Aug 13, 2020

karelz added bug and removed untriaged New issue has not been triaged by the area owner labels Aug 13, 2020

tarekgh removed the area-System.Diagnostics.Tracing label Aug 13, 2020

ManickaP self-assigned this Aug 18, 2020

ManickaP mentioned this issue Aug 19, 2020

Moved HTTP request telemetry to HttpClient.SendAsync #41022

Merged

karelz assigned MihaZupan and unassigned ManickaP Aug 27, 2020

MihaZupan closed this as completed Aug 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ActivityId missing in a newly added HttpTelemetry Stop events. #40776

ActivityId missing in a newly added HttpTelemetry Stop events. #40776

ManickaP commented Aug 13, 2020 •

edited by MihaZupan

Loading

ghost commented Aug 13, 2020

ghost commented Aug 13, 2020

ManickaP commented Aug 14, 2020

noahfalk commented Aug 14, 2020

ManickaP commented Aug 14, 2020

MihaZupan commented Aug 14, 2020

ManickaP commented Aug 14, 2020

MihaZupan commented Aug 14, 2020 •

edited

Loading

ManickaP commented Aug 14, 2020

KalleOlaviNiemitalo commented Aug 14, 2020 •

edited

Loading

stephentoub commented Aug 14, 2020

noahfalk commented Aug 14, 2020

ManickaP commented Aug 14, 2020

samsp-msft commented Aug 17, 2020

noahfalk commented Aug 17, 2020 •

edited by karelz

Loading

samsp-msft commented Aug 17, 2020

samsp-msft commented Aug 19, 2020 via email •

edited by ManickaP

Loading

MihaZupan commented Aug 31, 2020

karelz commented Sep 1, 2020 •

edited

Loading

ActivityId missing in a newly added HttpTelemetry Stop events. #40776

ActivityId missing in a newly added HttpTelemetry Stop events. #40776

Comments

ManickaP commented Aug 13, 2020 • edited by MihaZupan Loading

What needs to be done to achieve desired statistics:

ghost commented Aug 13, 2020

ghost commented Aug 13, 2020

ManickaP commented Aug 14, 2020

Details:

Several options how to solve it:

noahfalk commented Aug 14, 2020

ManickaP commented Aug 14, 2020

MihaZupan commented Aug 14, 2020

ManickaP commented Aug 14, 2020

MihaZupan commented Aug 14, 2020 • edited Loading

ManickaP commented Aug 14, 2020

KalleOlaviNiemitalo commented Aug 14, 2020 • edited Loading

stephentoub commented Aug 14, 2020

noahfalk commented Aug 14, 2020

ManickaP commented Aug 14, 2020

samsp-msft commented Aug 17, 2020

noahfalk commented Aug 17, 2020 • edited by karelz Loading

samsp-msft commented Aug 17, 2020

samsp-msft commented Aug 19, 2020 via email • edited by ManickaP Loading

MihaZupan commented Aug 31, 2020

karelz commented Sep 1, 2020 • edited Loading

ManickaP commented Aug 13, 2020 •

edited by MihaZupan

Loading

MihaZupan commented Aug 14, 2020 •

edited

Loading

KalleOlaviNiemitalo commented Aug 14, 2020 •

edited

Loading

noahfalk commented Aug 17, 2020 •

edited by karelz

Loading

samsp-msft commented Aug 19, 2020 via email •

edited by ManickaP

Loading

karelz commented Sep 1, 2020 •

edited

Loading