Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add healthcheck to standalone #784

Merged
merged 9 commits into from
May 11, 2018

Conversation

eundoosong
Copy link
Contributor

@eundoosong eundoosong commented Apr 23, 2018

healthcheck is added in the same way as collect and query since it is missed in standalone.
Related issue : #775

Signed-off-by: Eundoo Song [email protected]

Signed-off-by: Eundoo Song <[email protected]>
@coveralls
Copy link

coveralls commented Apr 23, 2018

Coverage Status

Coverage remained the same at 100.0% when pulling af74207 on eundoosong:healcheck_standalone into 5eb2922 on jaegertracing:master.

@yurishkuro
Copy link
Member

So this is actually adding two independent health check endpoints, for collector and query.

Could you describe in the PR why this is needed, how it can be used? Our objective is to only add features that are actually useful to people (and documented!).

Thanks.

@eundoosong
Copy link
Contributor Author

I think it would be better for people who want to use health check in the same way as separated one.
Because it is more consistent rather than providing another end point.
Plus, help message already shows each healthcheck flags of collector and query.
Documentation should be as different PR in documentation repo.

Thanks.

@yurishkuro
Copy link
Member

I think it would be better for people who want to use health check in the same way as separated one.

But how would you use it? E.g. our internal deployment system and discovery system (part of routing) only support one health endpoint per service.

I am not suggesting a separate endpoint (we serve metrics from one port only, healthcheck would be on the same port), but I think there should be only one health status for the binary, not two that are potentially conflicting.

@jpkrohling
Copy link
Contributor

I think there should be only one health status for the binary, not two that are potentially conflicting.

+1, from an operator's perspective, it makes more sense to have one endpoint that reports the health of the system in general than one health check endpoint for each component.

@eundoosong
Copy link
Contributor Author

eundoosong commented Apr 24, 2018

we serve metrics from one port only, healthcheck would be on the same port

I don't get this, does this mean healthcheck is using the same port as metrics?

I agree that one endpoint of the health check per service(binary) makes more sense.
Thanks for pointing out.
So, if you agree with adding one healthcheck in standalone, the following should be considered.

  • Add a healthcheck for standalone. (with new port number)
  • Add a flag like "standalone.health-check-http-port"
  • Hide existing collector(--collector.health-check-http-port) and query(--query.health-check-http-port) flags

I think the implementation will not be many changes, but I'm wondering which port number can be used.

@yurishkuro
Copy link
Member

You don't need a new port, all-in-one already opens several HTTP ports, you can just add an endpoint to one of them. E.g. we define ports for the /metrics endpoint here https://www.jaegertracing.io/docs/monitoring/. In all-in-one the /metrics endpoint is served from query service API port (https://github.com/jaegertracing/jaeger/pull/784/files#diff-053312057ab6b9282b572674adc5b7f6R298), so it makes the most sense to me to add the /health endpoint to the same port.

@eundoosong
Copy link
Contributor Author

eundoosong commented Apr 26, 2018

Your suggestion looks good. But since health check is using its own HTTP server(with ServeMux) and the other service(collector, query) is using Gorilla Mux, it's a bit difficult for health check to use other Mux without change in health check. I did implement (eundoosong@f6a8f5d) but IMO it doesn't look better in term of design.

So I decided collector port for health check(port num is 14269) is used for standalone health check port.
Please review.

func(flagSet *flag.FlagSet) {
flagSet.Int(healthCheckHTTPPort, defaultHealthCheckPort,
"The http port for the health check services")
},
Copy link
Contributor Author

@eundoosong eundoosong Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health-check for standalone is implemented inside main because there is only one flag existed in standalone.


if err := command.Execute(); err != nil {
fmt.Println(err.Error())
os.Exit(1)
}
}

func hideQueryCollectorFlags(command *cobra.Command) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hides query and collector health check flags.

collectorHTTPPort = "collector.http-port"
collectorZipkinHTTPort = "collector.zipkin.http-port"
// CollectorHealthCheckHTTPPort is the flag name for health check
CollectorHealthCheckHTTPPort = "collector.health-check-http-port"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better approach would be to have a single global flag for all binaries, similar to how we do log levels or metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions,

  • Does this mean we need to take that flag out from collector flags and make a global flag?

similar to how we do log levels or metrics.

  • Are you talking about BasicOptions(cmd/builder/builder_options.go)? How about SharedFlags(cmd/flags/..)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 - yes
2 - I meant shared flags, the BasicOption is independent of cli flags

// AddFlags adds flags for SharedFlags
func AddFlags(flagSet *flag.FlagSet) {
flagSet.String(spanStorageType, "", fmt.Sprintf("Deprecated; please use %s environment variable", storage.SpanStorageTypeEnvVar))
flagSet.String(logLevel, "info", "Minimal allowed log Level. For more levels see https://github.com/uber-go/zap")
flagSet.Int(healthCheckHTTPPort, 0, "The http port for the health check service")
Copy link
Contributor Author

@eundoosong eundoosong Apr 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default port is removed since it's difficult to pass in the current interface. So I let NewHealthCheck set it when each component initializes health check.
Do we need to add a default port in help message? Maybe like below..

flagSet.Int(healthCheckHTTPPort, 0, "The http port for the health check service. For default port, see https://https://www.jaegertracing.io/XXXXX")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better of the default port was printed by help. One option to do that is to define it in a global variable, e.g. defaulted to collector port, and then query service main() can override that var at the beginning (before registering flags)

startQuery(qOpts, spanReader, dependencyReader, logger, metricsFactory, mBldr)
startCollector(cOpts, spanWriter, logger, metricsFactory, samplingHandler, hc)
startQuery(qOpts, spanReader, dependencyReader, logger, metricsFactory, mBldr, hc)
hc.Ready()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily know how to fix it, but there is a race condition here where collector or query background goroutines can fail to start the server and set HC to Unavailable, but this statement will override it.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from race issue in all-in-one, lgtm

@ghost ghost assigned yurishkuro Apr 30, 2018
@ghost ghost added the review label Apr 30, 2018
@eundoosong eundoosong force-pushed the healcheck_standalone branch from a0d42ad to 64b9368 Compare May 9, 2018 15:45
logger.Info("Starting jaeger-query HTTP server", zap.Int("port", qOpts.Port))
go collectorServer()
go queryServer()
hc.Ready()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move goroutines to main from each start function, and start them in main then set Ready.
IMO, this will prevent the race issue.

For 100% guarantees, can we move hc.Ready before goroutines like this?

hc.Ready()
go collectorServer()
go queryServer()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think returning functions like this provides any additional guarantees. I think it's fine to leave it as it was, i.e. call startCollector(), it might fail mid-way and crash the program, or it can spawn goroutine for ListenAndServe, which can fail as well and crush the program later. So there will be a possible time interval where HC==Ready, and then a crash, but I don't think it's critical for all-in-one.

Sorry I led you on the wrong path with the last comment. The fact that the goroutines call log.Fatal on error sort of take care of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted hc.Ready() to be called before collector/query goroutine call hc.Set(healthcheck.Unavailable) in ListenAndServe error.
I think this would prevent overriding Unavailable by hc.Ready() as you mentioned above coment.

Anyway, if the issue in all-in-one is not critical and fatal log on error is enough, I'll revert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think it's critical sice even if the override happens the process will exit anyway.

@eundoosong
Copy link
Contributor Author

Reverted.
Ready to be merged, or is there anything else missing?

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yurishkuro yurishkuro merged commit a4706c0 into jaegertracing:master May 11, 2018
@ghost ghost removed the review label May 11, 2018
mabn pushed a commit to mabn/jaeger that referenced this pull request May 28, 2018
* master: (38 commits)
  Preparing release 1.5.0 (jaegertracing#847)
  Add bounds to memory storage (jaegertracing#845)
  Add metric for debug traces (jaegertracing#796)
  Change metrics naming scheme (jaegertracing#776)
  Bump gocql version (jaegertracing#829)
  Remove ParentSpanID from domain model (jaegertracing#831)
  Make gas run quiet (jaegertracing#838)
  Revert "Make gas run quite"
  Revert "Install gas from install-ci"
  Install gas from install-ci
  Make gas run quite
  Add 'gas' for security problems scanning (jaegertracing#830)
  Add ability to adjust static sampling probabilities per operation (jaegertracing#827)
  Support log-level flag on agent (jaegertracing#828)
  Remove unused function (jaegertracing#822)
  Add healthcheck to standalone (jaegertracing#784)
  Do not use KeyValue fields directly and use KeyValues as decorator only (jaegertracing#810)
  Add ContaAzul to the adopters list (jaegertracing#806)
  Add ISSUE_TEMPLATE and PULL_REQUEST_TEMPLATE (jaegertracing#805)
  Upgrade to  go 1.10 (jaegertracing#792)
  ...

# Conflicts:
#	cmd/agent/app/builder.go
#	cmd/collector/main.go
#	cmd/query/main.go
#	cmd/standalone/main.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants