Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supportability: improve own logs #2102

Closed
1 of 6 tasks
tigrannajaryan opened this issue Nov 10, 2020 · 9 comments
Closed
1 of 6 tasks

Supportability: improve own logs #2102

tigrannajaryan opened this issue Nov 10, 2020 · 9 comments
Labels
area:miscellaneous enhancement New feature or request help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p2 Medium release:allowed-for-ga Stale

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Nov 10, 2020

The Collector's own logs are an important source of information for troubleshooting. In some cases own logs, available locally are the only available information for troubleshooting. Other source, such as own metrics require the Collector to be correctly configured to scrape itself, send the metrics to the backend and for the backend to be available. Even zPages, which are exposed locally by the Collector may not be available if for example the Collector crashes. In such cases logs are the only useful source of troubleshooting information.

In order to increase the value of Collector's log I suggest to make a few improvements:

  • Make logs more human-readable locally: switch from "json" to "console" encoding even when "release" logging configuration is selected (but keep Production zap Logger configuration for "release" mode). This is still fairly well-defined format that can be unambiguously parse on the backend if needed Use "console" encoding for own logs #2106
  • Periodically output values of Collector's own metrics in the log: Supportability: print own metrics in logs #2098
  • Introduce a rate-limited logger that all components can use to log failures during operation. Exporter helper already does this for most exporters that use queued_retry. We will need similar capabilities for receivers, processors and extensions.
  • Ensure all core exporters correctly report errors (and that such errors are visible in the logs) when destination is unavailable and when destination is available but responds with an error that is likely a result of misconfiguration (e.g. HTTP 404 due to an incorrect destination endpoint in the config).
    Additional ideas that may be worth doing:
  • When data is not received at the backend it is not clear where it is lost. Ensure all pipeline components clearly expose the counters about the data they see, dump the counters to logs periodically (e.g. every minute).
  • When receivers see zero data for a while since startup (e.g. for 1 minute) log a warning (unless it is a kind of receiver where it is normal to not see any data for a long time).

Note: we need to be careful to not flood the logs.

@tigrannajaryan tigrannajaryan added feature request help wanted Good issue for contributors to OpenTelemetry Service to pick up labels Nov 10, 2020
@andrewhsu andrewhsu added the priority:p2 Medium label Nov 11, 2020
@jpkrohling
Copy link
Member

jpkrohling commented Nov 12, 2020

Can we please discuss this before continuing? I'm not sure the first item in the list is a good move, especially without a heads-up to current users of the collector.

I think moving from structured JSON logs to "mix of tab-separated and JSON" logs is a step back. People who want to have human-readable logs should be able to use the dev profile, whereas the vast majority of people could still use the JSON features.

Before this change, I could simply pipe the console output to jq and extract fields I'm interested in. With this change, I have to use awk and hope that I get the correct fields from the logs.

At the very least, this new (breaking) change should be behind a flag so that people can control which behavior they want.

Before #2106/#2109, where how logs looked like in the --log-profile=prod:

{"level":"info","ts":1605170062.2534976,"caller":"builder/receivers_builder.go:75","msg":"Receiver started.","component_kind":"receiver","component_type":"otlp","component_name":"otlp"}

And here's how it looks now:

2020-11-12T09:34:31.205+0100	info	builder/receivers_builder.go:75	Receiver started.	{"component_kind": "receiver", "component_type": "otlp", "component_name": "otlp"}

@tigrannajaryan
Copy link
Member Author

@jpkrohling Yes, let's discuss.

I understand what you say, however I disagree.

Collector is a unique piece of infrastructure. It needs to be observable and debuggable even when the observability system is broken. This means we cannot rely only on traditional approaches to observe the Collector. One of the important areas where I think we have to be different from the traditional practices is readability of logs. Typically for every other service using JSON is highly desirable since it is machine readable and can be collected in the logging system precisely where it is searchable and queryable and where most people will be looking at the logs.
With the Collector we cannot rely on that. Instead it is much more important for logs to be easily consumable by humans who only have a console available to them. Piping JSON logs to jq does not result in sufficiently readable logs, they are still very difficult to read.

Tab-delimited logs are vastly more readable when all you have is the console.

At the very least, this new (breaking) change should be behind a flag so that people can control which behavior they want.

I agree. I think non-JSON logs should be the default and we can have a command line option to output JSON logs.

We can also add this change to the CHANELOG to bring more visibility to it.

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 18, 2020
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 18, 2020
@tigrannajaryan
Copy link
Member Author

@jpkrohling see #2177

@jpkrohling
Copy link
Member

Thanks for addressing my concerns, @tigrannajaryan!

Collector is a unique piece of infrastructure. It needs to be observable and debuggable even when the observability system is broken.

Agree, but I think it ends up depending on how we see/deal with the collector in production. If we have only a few instances (pet), we are likely to look at logs for the individual instances. However, for highly elastic scenarios (cattle), we'd rather have the logs being sent elsewhere and post-processed. In which case, having it easy to be parsed by machines is preferable.

Based on the same train of thought, #2098 might probably not bring many benefits.

@pkositsyn
Copy link
Contributor

My issue is closed as a duplicate, so I'd like to comment it here. I have a question about this point

Introduce a rate-limited logger that all components can use to log failures during operation. Exporter helper already does this for most exporters that use queued_retry. We will need similar capabilities for receivers, processors and extensions.

I think there is a simple solution to add a global system logs rate limit with a corresponding flag.

Why do I think it's actually enough for any purposes:

  1. This is a somewhat hard limit for logs. You know that more logs than this limit won't give any information or more logs means overload of logs delivery pipeline or just a performance degradation due to excessive communication with console.
  2. This limit is something you should never reach. Reaching means some critical error in code (usually logging every request meeting some condition). Again, this is just safety of the whole pipeline.

Even if you don't like the solution that I offer, this is a critical thing we cannot limit the total number of logs. Some implemented components do logging on every request (maybe error request, but anyway) and it's very hard to get rid of all those places

@jpkrohling
Copy link
Member

Some implemented components do logging on every request (maybe error request, but anyway)

I agree with your proposal as a whole, and I think our logger (zap) does support rate limiting. In any case, it would be good to have bug reports against those components. Logging on error is desirable, but we might be able to optimize hot paths...

@pkositsyn
Copy link
Contributor

Yes, there is actually a possibility to tune logger. It's a pity I cannot do it from the box. Need to do manipulations with code. Is there a position on not exposing more flags?

@jpkrohling
Copy link
Member

Is there a position on not exposing more flags?

Not that I'm aware of. Each new flag comes with the need to document and maintain it, but nothing will prevent us from adding one if they are justifiable.

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
* Correct status transform in OTLP exporter

* Add changes to changelog
@NickLarsenNZ
Copy link

NickLarsenNZ commented Nov 23, 2022

As for formatting, I think logfmt should be an option alongside JSON (and perhaps use the ZAP log structure). I find it far more readable raw, but still parsable (eg: within Grafana).

@github-actions github-actions bot added the Stale label Nov 24, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:miscellaneous enhancement New feature or request help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p2 Medium release:allowed-for-ga Stale
Projects
None yet
Development

No branches or pull requests

5 participants