Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clock skew adjuster and spans from web/mobile #722

Closed
mabn opened this issue Mar 1, 2018 · 4 comments
Closed

Clock skew adjuster and spans from web/mobile #722

mabn opened this issue Mar 1, 2018 · 4 comments
Labels

Comments

@mabn
Copy link

mabn commented Mar 1, 2018

Spans captured on the web/mobile do not have accurate start times as those devices are not synchronized with the server clocks.
Because of this I use timestamp of the moment they are received by the backend.
It's still wrong because:

  1. they are received after they finish so the start timestamp is larger than it should be by at least: span duration + network latency (assuming synchronized clocks between browser and backend)
  2. mobile devices can be offline and go online and sync days later
  3. it's likely that such spans are batched and flushed periodically (with even more delay) to avoid many small http requests which drain batteries

The problem is that clock-adjusted "centers" everything if the web span is too much off:
Alt text

This is how it looks if the web span is not reported as a root. The positions more or less reflect reality:
Alt text

Am I doing something wrong?
How to approach reporting mobile/web spans?
Does clock skew adjusted need some changes to support web/mobile spans?

I'm using version 1.2.0

@yurishkuro
Copy link
Member

There may be a bug in the clock skew adjustment algorithm. While it does center child span if the parent span's timing is way off, it's not supposed to center everything, i.e. if most server side spans have reasonable relative timestamps their relative positions should be sensible.

It would be helpful if you could post a JSON of the trace that gets centered incorrectly. Unfortunately, we don't have an obfuscation utility, so if you obfuscate manually please make sure to keep the process tags consistent. And you can strip all tags and logs aside from span.kind tags.

@mabn
Copy link
Author

mabn commented Mar 9, 2018

Here it is: https://gist.github.com/mabn/ec35eb5d6f4a9789b46fd1373c601352
It looks like this (web is the root span):
Alt text

  • we don't have span.kind tags - not sure if it matters?
  • processes do not have "ip" tag so in clockskew.go hostKey defaults to "" and all are treated like separate hosts (which is ok)

@mabn
Copy link
Author

mabn commented Mar 9, 2018

The issue is in calculateSkew. E.g. for such test case:

{
	description: "root starts after all descendants",
	trace: []spanProto{
		{id: 1, parent: 0, startTime: 10, duration: 100, host: "a", adjusted: 10},
		// latency = (100-50) / 2 = 25
		// delta = (10 - 0) + latency = 35
		{id: 2, parent: 1, startTime: 0, duration: 50, host: "b", adjusted: 35,
			logs: []int{5, 10}, adjustedLogs: []int{40, 45}},
		// child fits inside parent - no additional adjustment
		// adjusted = startTime + parentSkew (35) = 45 - but current logic makes it 55
		{id: 3, parent: 2, startTime: 10, duration: 10, host: "c", adjusted: 45},
	},
},

Initially span 3 fits under its parent - span 2 starts at 0, span 3 starts later at 10. Adjustment starts:

  1. span 1 is not adjusted - starts at 10
  2. span 2 starts before parent so it's adjusted (centered) under parent - it now starts at 35
  3. span 3 no longer fits under its parent - it starts at 10, but parent (span 2) starts at 35 - so it is also adjusted additionally and "centered" under span 2
  4. if there are more spans in the subtree it will repeat for each of them if root span started sufficiently ahead in time

I'm not entirely sure what is the expected behaviour.

  • In the case of root span coming from web with a delay I'd like to somehow alter only this one - e.g. mark it as having "unreliable clock" and just adjust its start time to match with the rest of the trace. So the algorithm descending from roots might not work here.

  • In the case where all of them are from a single datacenter with more or less reliable clocks and one is "too far ahead" - not sure. It would be good to somehow detect that one clock is completely off.

    One solution that would solve this particular case would be to assume that if initially a child fits under its parent and the parent has to be adjusted then the child is adjusted by the same amount.
    It would basically shift portions of the trace without altering relations between spans.
    But it doesn't solve the case with web properly - the structure (relative timing) would be as expected, but start times would be too far ahead. E.g. if root span is reported 30 sec later then the whole trace will appear as if it happened 30 sec later.

@jkowall
Copy link
Contributor

jkowall commented Jun 5, 2024

This is something Otel is dealing with currently (relevant discussion here open-telemetry/oteps#154) There are similar discussions in the OpenTelemetry js repo as well since it's a bigger issue coming from the web side. Since there is no standard to handle this type of skew, and it would normally be defined in the instrumentation, Jaeger is merely the visualizer.

@jkowall jkowall closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants