Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add basic OT tracing for incoming requests #283

Merged
merged 2 commits into from
Dec 1, 2021
Merged

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Nov 25, 2021

Closes: #271

Implements some very basic opentelemetry tracing for incoming requests.

  • 3 spans:
    1. Incoming Request() call, we add peerID, root and extensions to the attributes
    2. When we reach newRequest(), we add requestID to attributes
    3. When we reach the executor and executeTask(), we don't add any more attributes
  • Works fine across the go-routine boundary, passing Span around and joining it with the local context
  • Added some test utilities to collect trace data and do basic analysis on it - mostly we just assert that we have certain traces by name recorded (e.g. for a single request we can assert we have just a single trace: request->newRequest->executeTask). Some tests don't reach executeTask, and in one test we assert that we got the attributes we expect on each of the spans.

I haven't added any errors to the traces yet, there's just so many ways a request can go bad and I'm not sure we want to add all of those? We can RecordError for any of these errors, and where the error is fatal (pretty much always) we SetStatus(error) for the span. That might make for interesting tracing output, but a lot of extra code I suspect. Thoughts on that welcome.

Below is the trace data output as JSON for the 3 spans so far.

Trace data as JSON
  {
    "Name": "request",
    "SpanContext": {
      "TraceID": "a83c2aa04222c7e8026e64757eddc3df",
      "SpanID": "5238ff3bcb4c551e",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "00000000000000000000000000000000",
      "SpanID": "0000000000000000",
      "TraceFlags": "00",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-25T13:24:39.860691243+11:00",
    "EndTime": "2021-11-25T13:24:39.860779479+11:00",
    "Attributes": [
      {
        "Key": "peerID",
        "Value": {
          "Type": "STRING",
          "Value": "13xCCMPNr"
        }
      },
      {
        "Key": "root",
        "Value": {
          "Type": "STRING",
          "Value": "bafyreiey6suh5ogt2y6hbip2jid2vyukw2ws2afkqak5bz7unu67tj4ykm"
        }
      },
      {
        "Key": "extensions",
        "Value": {
          "Type": "STRINGSLICE",
          "Value": [
            "AppleSauce/McGee"
          ]
        }
      }
    ],
    "Events": null,
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 1,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  }
  {
    "Name": "newRequest",
    "SpanContext": {
      "TraceID": "a83c2aa04222c7e8026e64757eddc3df",
      "SpanID": "f346d1e5d1f8fe77",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "a83c2aa04222c7e8026e64757eddc3df",
      "SpanID": "5238ff3bcb4c551e",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-25T13:24:39.860740055+11:00",
    "EndTime": "2021-11-25T13:24:39.860768889+11:00",
    "Attributes": [
      {
        "Key": "requestID",
        "Value": {
          "Type": "INT64",
          "Value": 0
        }
      }
    ],
    "Events": null,
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  }
  {
    "Name": "executeTask",
    "SpanContext": {
      "TraceID": "a83c2aa04222c7e8026e64757eddc3df",
      "SpanID": "5c2bc000ab59c8fc",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "a83c2aa04222c7e8026e64757eddc3df",
      "SpanID": "f346d1e5d1f8fe77",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-25T13:24:39.860832158+11:00",
    "EndTime": "2021-11-25T13:24:39.861878458+11:00",
    "Attributes": null,
    "Events": null,
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  }

@rvagg rvagg requested a review from hannahhoward November 25, 2021 02:41
@rvagg
Copy link
Member Author

rvagg commented Nov 25, 2021

OK, so I'm adding some error paths to this, I can see how this can be done in a fairly minimal way.

But the problem I have now is that some of the tests are not getting through to the executeTask part of the trace with -race in CI where I would expect them to, like TestRoundTripLargeBlocksSlowNetwork which is supposed to do the full thing. I might be missing something about how the requestmanager works and this is reasonable.

@rvagg
Copy link
Member Author

rvagg commented Nov 26, 2021

Implemented a wait condition on the TaskQueue to avoid the timing problems of testing this, see #284. This can also be used to deal with the flaky introduced in the responsemanager recently.

@rvagg
Copy link
Member Author

rvagg commented Nov 26, 2021

Have added some test assertions around collecting errors in here too, all looks pretty reasonable.

I did wonder whether it's worthwhile plumbing in some stats too - like block and byte count, but that's pretty deep in the call stack, deeper than the current spans. So maybe later if a user says that's going to be actually helpful.

@rvagg
Copy link
Member Author

rvagg commented Nov 26, 2021

These traces stop at executeTask and don't cover the releaseRequestTask -> terminateRequest portion. I'm not sure that's necessary, but maybe it is? The question would be what such traces should look like:

request->newRequest->executeTask->terminateRequest - that's the logical call flow, but in the pause/resume case you can have multiple traces branching from newRequest:

  • request(0)->newRequest(0)->executeTask(0)
  • request(0)->newRequest(0)->executeTask(1)
  • request(0)->newRequest(0)->executeTask(2)

edit: perhaps better visualised as:

  • request(0)
    • newRequest(0)
      • executeTask(0)
      • executeTask(1)
      • executeTask(2)

Should the last one be request(0)->newRequest(0)->executeTask(2)->terminateRequest(0)? Or should there be an additional trace: request(0)->newRequest(0)->terminateRequest(0) to associate the termination with the request rather than the call flow.

If we want to hang this off executeTask then I'll need to plumb a span hand-over back through releaseRequestTaskMessage, or add an additional span on inProgressRequestStatus for this purpose. Neither of which is particularly clean, but workable.

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is how we're terminating spans. It doesn't look like it's taking into account go routines.

We should define what we expect these spans to cover, and then work to implement them properly.

I am assuming:
reqest -> call to request all the way to all responses collected?
newRequest -> request popped off queue all the way to end of request
executeTask -> length of execute task.

If I am correct, only executeTask is currently correct.

impl/graphsync.go Outdated Show resolved Hide resolved
requestmanager/server.go Show resolved Hide resolved
@rvagg
Copy link
Member Author

rvagg commented Nov 29, 2021

OK, good points, I'll investigate.
I suspect that for most purposes it doesn't matter, if you look at the trace dump in the OP (hidden in <details>), it's able to piece the hierarchy together with proper parent->child relationships. So looking at traces should at least show the call hierarchy.
Where it will fall down is the timing, and that's more a matter of intent - what do we want to demonstrate with the timing? A call, or something slightly more abstract—a "request"'s lifecycle. Maybe it's a bit of both: a "Request" lives for its entire lifespan, but the individual steps (newRequest, execute etc.) of it are measuring call lifespans.

@rvagg rvagg force-pushed the rvagg/tracing branch 6 times, most recently from 40d0926 to a6c77ee Compare November 30, 2021 04:34
@rvagg
Copy link
Member Author

rvagg commented Nov 30, 2021

I think this is ready for review.

Rearranged how we're managing spans. There's a parent request span that lives for the lifespan of a request as it moves through its phases. It's ended after the request has been cleaned up. The other spans are direct children of this, with no additional hierarchy and they measure individual calls rather than spanning across multiple, separate calls. executeTask does the most work but with a pause/resume you can have multiple of these. The layout is now:

  • request(0) (set known properties about this request)
    • newRequest(0) (set the requestID onto the parent request(0) trace now that we have it)
    • executeTask(0) (we can assume there'll mostly just be one of these)
    • executeTask(1)
    • executeTask(n)
    • terminateRequest(0)

Example spans and their data seen in the details below; note in this one that the first executeTask has an ErrPaused exception recorded in it so you can see why you have >1 executions within a request.

I also implemented the mirror of #287 but for the RequestManager since I needed the executor to block until the request was actually done. As I said over there, I think this will be a good addition to both sides to make sure our async executors have a strict parallelism and don't do funky overlaps at the edges.

`TestPauseResumeResponse` spans
  {
    "Name": "newRequest",
    "SpanContext": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "0f81fd9be8d71479",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "8553f61a304a42ca",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-30T15:12:50.136492228+11:00",
    "EndTime": "2021-11-30T15:12:50.136548264+11:00",
    "Attributes": null,
    "Events": null,
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  },
  {
    "Name": "executeTask",
    "SpanContext": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "59380d18efc2aa6a",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "8553f61a304a42ca",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-30T15:12:50.136588429+11:00",
    "EndTime": "2021-11-30T15:12:50.139372173+11:00",
    "Attributes": null,
    "Events": [
      {
        "Name": "exception",
        "Attributes": [
          {
            "Key": "exception.type",
            "Value": {
              "Type": "STRING",
              "Value": "github.com/ipfs/go-graphsync/requestmanager/hooks.ErrPaused"
            }
          },
          {
            "Key": "exception.message",
            "Value": {
              "Type": "STRING",
              "Value": "request has been paused"
            }
          }
        ],
        "DroppedAttributeCount": 0,
        "Time": "2021-11-30T15:12:50.139360731+11:00"
      }
    ],
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  },
  {
    "Name": "request",
    "SpanContext": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "8553f61a304a42ca",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "00000000000000000000000000000000",
      "SpanID": "0000000000000000",
      "TraceFlags": "00",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-30T15:12:50.136446021+11:00",
    "EndTime": "2021-11-30T15:12:50.242104668+11:00",
    "Attributes": [
      {
        "Key": "peerID",
        "Value": {
          "Type": "STRING",
          "Value": "1WRsHVXpAnpLmF"
        }
      },
      {
        "Key": "root",
        "Value": {
          "Type": "STRING",
          "Value": "bafyreih6oouc26gzpf7okbwixooccyxdftbe42x3pwqzjdd4ff3mm37g2u"
        }
      },
      {
        "Key": "extensions",
        "Value": {
          "Type": "STRINGSLICE",
          "Value": [
            "AppleSauce/McGee"
          ]
        }
      },
      {
        "Key": "requestID",
        "Value": {
          "Type": "INT64",
          "Value": 0
        }
      }
    ],
    "Events": null,
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 4,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  },
  {
    "Name": "terminateRequest",
    "SpanContext": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "07e33ccce2569478",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "8553f61a304a42ca",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-30T15:12:50.242062248+11:00",
    "EndTime": "2021-11-30T15:12:50.242109046+11:00",
    "Attributes": null,
    "Events": null,
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  },
  {
    "Name": "executeTask",
    "SpanContext": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "6d6240e1091eaf9a",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "Parent": {
      "TraceID": "4ec0852efb6fb6c3cdd0a8b081bd1f65",
      "SpanID": "8553f61a304a42ca",
      "TraceFlags": "01",
      "TraceState": "",
      "Remote": false
    },
    "SpanKind": 1,
    "StartTime": "2021-11-30T15:12:50.240871254+11:00",
    "EndTime": "2021-11-30T15:12:50.242113534+11:00",
    "Attributes": null,
    "Events": null,
    "Links": null,
    "Status": {
      "Code": "Unset",
      "Description": ""
    },
    "DroppedAttributes": 0,
    "DroppedEvents": 0,
    "DroppedLinks": 0,
    "ChildSpanCount": 0,
    "Resource": [
      {
        "Key": "service.name",
        "Value": {
          "Type": "STRING",
          "Value": "unknown_service:impl.test"
        }
      },
      {
        "Key": "telemetry.sdk.language",
        "Value": {
          "Type": "STRING",
          "Value": "go"
        }
      },
      {
        "Key": "telemetry.sdk.name",
        "Value": {
          "Type": "STRING",
          "Value": "opentelemetry"
        }
      },
      {
        "Key": "telemetry.sdk.version",
        "Value": {
          "Type": "STRING",
          "Value": "1.2.0"
        }
      }
    ],
    "InstrumentationLibrary": {
      "Name": "graphsync",
      "Version": "",
      "SchemaURL": ""
    }
  }

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Looking good!

@rvagg rvagg merged commit 6839edd into main Dec 1, 2021
@rvagg rvagg deleted the rvagg/tracing branch December 1, 2021 00:05
marten-seemann pushed a commit that referenced this pull request Mar 2, 2023
* update to context datastores

* Update go-ds-versioning, go mod tidy

Co-authored-by: Aayush Rajasekaran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenTelemetry Tracing Request Manager
2 participants