Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: Support 128bit trace id with zipkin tracer #3386

Merged
merged 9 commits into from
May 22, 2018

Conversation

objectiser
Copy link
Contributor

@objectiser objectiser commented May 15, 2018

Description:
Support 128 bit X-B3-TraceId. Currently if a 128bit id is provided by an external application, it results in a NullSpan so the activity performed in the proxy is not reported to the backend tracing system. New trace instances created by the proxy will now also use a 128bit trace id.

NOTE: If a service mesh includes proxies with a mix of versions, some with and without this change, an older proxy will no longer trace activity. It is the same as if (with the current version) an application outside the mesh provides a trace context using the 128bit trace id.

Risk Level: Low

Testing:
Test added to confirm support for 128bit id. The Span already had support for the high part of the trace id - just was never used - but tests were already included.
Manual testing using the zipkin-tracing and jaeger-tracing examples, supplying no trace id (so generating 128bit id in proxy), and supplying both 64bit and 128bit trace ids to check works with both.

Docs Changes:
Don't believe any documentation changes are required, as the format of the trace id is not discussed (I believe).

Release Notes:
TO BE DONE - once it is decided how best to deal with the impact.

If this change is user impacting OR extension developer impacting (filter API, etc.) you must
add a release note to version_history.rst. Please include
any relevant links. Each release note should be prefixed with the relevant subsystem in
alphabetical order (see existing examples as a guide) and include links to relevant parts of the
documentation. Thank you! Please write in N/A if there are no release notes.

May address issue #3361

Signed-off-by: Gary Brown [email protected]

@objectiser
Copy link
Contributor Author

Was trying to think about best way to handle migrating to the 128bit trace ids - whether a new config option should be added that was disabled until all of the proxies had been updated to the appropriate version and then coordinate the config update.

However in reality the only reason 128bit ids would be used is if:
a) An external client supplies one - which doesn't work anyway
b) The initial ingress gateway/proxy is using the newest version.

So one way to control the rollout of the change is to ensure the ingress proxy is the last to be upgrade.

@objectiser objectiser changed the title WIP: tracing: Support 128bit trace id with zipkin tracer tracing: Support 128bit trace id with zipkin tracer May 15, 2018
@mattklein123
Copy link
Member

@adriancole do you mind taking a quick Zipkin sanity pass on this?

@objectiser
Copy link
Contributor Author

@mattklein123 While waiting for sanity check - would it be worth discussing the implications for rolling this out? As not clear what the best option might be.

@codefromthecrypt
Copy link
Contributor

codefromthecrypt commented May 19, 2018 via email

Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one bug, two adjustments, one nit

otherwise looks good!


/**
* Constructor that creates a context object from the supplied trace, span and
* parent ids, and sampled flag.
*
* @param trace_id The trace id.
* @param trace_id The low 64 bits of the trace id.
* @param trace_id_high The high 64 bits of the trace id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: usually we put the high bits before the low bits when ordering

ex trace_id_high then trace_id

/**
* @return whether using 128 bit trace id.
*/
bool is128BitTraceId() const { return trace_id_high_ > 0; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean != 0

zero is an invalid set of high bits, negative is fine

@@ -33,6 +33,7 @@ SpanPtr Tracer::startSpan(const Tracing::Config& config, const std::string& span
uint64_t random_number = random_generator_.random();
span_ptr->setId(random_number);
span_ptr->setTraceId(random_number);
span_ptr->setTraceIdHigh(random_generator_.random());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this always uses 128-bit? Probably fine by now, but might break some folks. Usually, we make this optional setting. ex https://github.com/openzipkin/brave/blob/master/brave/src/main/java/brave/Tracing.java#L270

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS this comment is about the implications of rollout. As-is, an old downstream service might break if they aren't prepared to consume 128-bit trace IDs. usually the behavior is to restart the trace, but not everything is consistent. That said, many libraries handle big IDs by now.

openzipkin/b3-propagation#6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mattklein123 on this one..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriancole Originally wasn't sure whether to make it configurable, as seems like 128 is fairly standard now - but there is potential for it causing disruption in a mesh if using different versions of envoy. So I will look to make it configurable, with default of 64 bit - but atleast the newer version will be able to handle a 128bit id even if it doesn't use it for new trace instances by default.

@@ -479,7 +479,10 @@ class Span : public ZipkinBase {
/**
* @return the span's trace id as a hexadecimal string.
*/
const std::string traceIdAsHexString() const { return Hex::uint64ToHex(trace_id_); }
const std::string traceIdAsHexString() const {
return (trace_id_high_.has_value() ? Hex::uint64ToHex(trace_id_high_.value()) : "") +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know cpp, but guessing the conditional would be quicker made in a way that doesn't always cat values?

ex.

return trace_id_high_.has_value()
           ? Hex::uint64ToHex(trace_id_high_.value()) +  Hex::uint64ToHex(trace_id_)
           :  Hex::uint64ToHex(trace_id_);

TEST_F(ZipkinDriverTest, ZipkinSpanContextFromB3Headers128TraceIdTest) {
setupValidDriver();

const std::string trace_id =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually test this.. below you aren't really testing this literally. For example, test that the high bits are in the right spot (so the input left hex characters are trace_id_high and right ones are trace_id)

@mattklein123
Copy link
Member

@mattklein123 While waiting for sanity check - would it be worth discussing the implications for rolling this out? As not clear what the best option might be.

What's the concern?

Signed-off-by: Gary Brown <[email protected]>
Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! thanks

@objectiser
Copy link
Contributor Author

@adriancole Thanks for the review - I'll add the configuration option as I think it will avoid any upgrade issues for users.

@@ -55,7 +55,8 @@ Zipkin driver
"type": "zipkin",
"config": {
"collector_cluster": "...",
"collector_endpoint": "..."
"collector_endpoint": "...",
"trace_id_128bit": true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add any new v1 config options. For v1 configs just hard code it to off during the translation? Do you mind unwinding these changes and the schema changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated - let me know if I have missed anything or misunderstood.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, just some minor comments. Thank you!

/**
* @return the sampled flag.
*/
bool sampled() const { return sampled_; }

private:
uint64_t trace_id_high_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this var be const? Same for the others?

private:
const std::string service_name_;
Network::Address::InstanceConstSharedPtr address_;
ReporterPtr reporter_;
Runtime::RandomGenerator& random_generator_;
bool trace_id_128bit_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't const due to the setter. Would you prefer this was added as a constructor parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general that would be clear if it fits in the code. Is it easy to do and get rid of the setter? Not a big deal if it's not possible.

@@ -40,6 +40,7 @@ class ZipkinCoreConstantValues {
const std::string NOT_SAMPLED = "0";

const std::string DEFAULT_COLLECTOR_ENDPOINT = "/api/v1/spans";
const bool DEFAULT_TRACE_ID_128BIT = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed (see below)

@@ -68,10 +68,14 @@ Driver::Driver(const Json::Object& config, Upstream::ClusterManager& cluster_man
const std::string collector_endpoint =
config.getString("collector_endpoint", ZipkinCoreConstants::get().DEFAULT_COLLECTOR_ENDPOINT);

tls_->set([this, collector_endpoint, &random_generator](
const bool trace_id_128bit =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always going to return false. You can just remove these lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, as when I test with the jaegertracing example updating the front-proxy yaml to include the 'trace_id_128bit: true', it uses a 128bit trace id - so this must mean that the value in the yaml is being converted to json and accessed via this condition?

If that is not how the v2 config is working, is there some docs that explain how the v2 and v1 configs are being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From further examination of the code it looks like many components have a factory that can take the v1 json or v2 proto message - however the tracer extension doesn't seem to have been updated to support the v2 proto config directly - so assume it is json conversion from the proto message? If so, then may be better to leave this in for now until proper separation between the v1 and v2 config for the tracer extension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 json isn't exactly the same as v2 proto (for example in proto we use byte arrays for IDs)

https://github.com/openzipkin/zipkin-api/blob/master/zipkin2-api.yaml#L298
https://github.com/openzipkin/zipkin-api/blob/master/zipkin.proto

Note the proto format is quite new, so requires a very recent version of zipkin-server or clone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriancole Sorry Adrian, I wasn't clear - I am referring to the envoy v1/v2 APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@objectiser hmm. Sorry. I guess we don't have direct proto factory functions for the tracers yet. Sorry, I thought we had already added this. This will need to be added. I will open a follow up issue for this. Yes don't worry about it for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Extract trace id - which can either be 128 or 64 bit. For 128 bit,
// it needs to be divided into two 64 bit numbers (high and low).
if (request_headers.XB3TraceId()->value().size() == 32) {
std::string tid = request_headers.XB3TraceId()->value().c_str();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const for all these local vars

@@ -20,7 +20,8 @@ TEST(ZipkinTracerConfigTest, ZipkinHttpTracer) {
std::string valid_config = R"EOF(
{
"collector_cluster": "fake_cluster",
"collector_endpoint": "/api/v1/spans"
"collector_endpoint": "/api/v1/spans",
"trace_id_128bit": true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert, and make a new v2 style YAML test to exercise the new feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have reverted the change. May need discussion on the type of v2 test, based on the outcome of the v1/v2 config discussion above.

@@ -106,7 +106,8 @@ TEST_F(ZipkinDriverTest, InitializeDriver) {
std::string valid_config = R"EOF(
{
"collector_cluster": "fake_cluster",
"collector_endpoint": "/api/v1/spans"
"collector_endpoint": "/api/v1/spans",
"trace_id_128bit": false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per other comments these will need to switch to v2 YAML tests.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, LMK if you want to change the setter thing or not. Thank you!

@objectiser
Copy link
Contributor Author

I can make the change but will have to do it later or tomorrow as out today.

Signed-off-by: Gary Brown <[email protected]>
@objectiser
Copy link
Contributor Author

@mattklein123 Hopefully should be ok now.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

@mattklein123 mattklein123 merged commit f9c0cf8 into envoyproxy:master May 22, 2018
@codefromthecrypt
Copy link
Contributor

Thanks a lot @objectiser and @mattklein123 for the review!

@objectiser objectiser deleted the zipkin128 branch May 28, 2018 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants