Supportability: improve own logs #2102

tigrannajaryan · 2020-11-10T14:45:21Z

The Collector's own logs are an important source of information for troubleshooting. In some cases own logs, available locally are the only available information for troubleshooting. Other source, such as own metrics require the Collector to be correctly configured to scrape itself, send the metrics to the backend and for the backend to be available. Even zPages, which are exposed locally by the Collector may not be available if for example the Collector crashes. In such cases logs are the only useful source of troubleshooting information.

In order to increase the value of Collector's log I suggest to make a few improvements:

Make logs more human-readable locally: switch from "json" to "console" encoding even when "release" logging configuration is selected (but keep Production zap Logger configuration for "release" mode). This is still fairly well-defined format that can be unambiguously parse on the backend if needed Use "console" encoding for own logs #2106
Periodically output values of Collector's own metrics in the log: Supportability: print own metrics in logs #2098
Introduce a rate-limited logger that all components can use to log failures during operation. Exporter helper already does this for most exporters that use queued_retry. We will need similar capabilities for receivers, processors and extensions.
Ensure all core exporters correctly report errors (and that such errors are visible in the logs) when destination is unavailable and when destination is available but responds with an error that is likely a result of misconfiguration (e.g. HTTP 404 due to an incorrect destination endpoint in the config).
Additional ideas that may be worth doing:
When data is not received at the backend it is not clear where it is lost. Ensure all pipeline components clearly expose the counters about the data they see, dump the counters to logs periodically (e.g. every minute).
When receivers see zero data for a while since startup (e.g. for 1 minute) log a warning (unless it is a kind of receiver where it is normal to not see any data for a long time).

Note: we need to be careful to not flood the logs.

jpkrohling · 2020-11-12T08:37:58Z

Can we please discuss this before continuing? I'm not sure the first item in the list is a good move, especially without a heads-up to current users of the collector.

I think moving from structured JSON logs to "mix of tab-separated and JSON" logs is a step back. People who want to have human-readable logs should be able to use the dev profile, whereas the vast majority of people could still use the JSON features.

Before this change, I could simply pipe the console output to jq and extract fields I'm interested in. With this change, I have to use awk and hope that I get the correct fields from the logs.

At the very least, this new (breaking) change should be behind a flag so that people can control which behavior they want.

Before #2106/#2109, where how logs looked like in the --log-profile=prod:

{"level":"info","ts":1605170062.2534976,"caller":"builder/receivers_builder.go:75","msg":"Receiver started.","component_kind":"receiver","component_type":"otlp","component_name":"otlp"}

And here's how it looks now:

2020-11-12T09:34:31.205+0100	info	builder/receivers_builder.go:75	Receiver started.	{"component_kind": "receiver", "component_type": "otlp", "component_name": "otlp"}

tigrannajaryan · 2020-11-17T17:54:01Z

@jpkrohling Yes, let's discuss.

I understand what you say, however I disagree.

Collector is a unique piece of infrastructure. It needs to be observable and debuggable even when the observability system is broken. This means we cannot rely only on traditional approaches to observe the Collector. One of the important areas where I think we have to be different from the traditional practices is readability of logs. Typically for every other service using JSON is highly desirable since it is machine readable and can be collected in the logging system precisely where it is searchable and queryable and where most people will be looking at the logs.
With the Collector we cannot rely on that. Instead it is much more important for logs to be easily consumable by humans who only have a console available to them. Piping JSON logs to jq does not result in sufficiently readable logs, they are still very difficult to read.

Tab-delimited logs are vastly more readable when all you have is the console.

At the very least, this new (breaking) change should be behind a flag so that people can control which behavior they want.

I agree. I think non-JSON logs should be the default and we can have a command line option to output JSON logs.

We can also add this change to the CHANELOG to bring more visibility to it.

Contributes to open-telemetry#2102

tigrannajaryan · 2020-11-18T01:55:53Z

@jpkrohling see #2177

Contributes to #2102

jpkrohling · 2020-11-23T11:20:22Z

Thanks for addressing my concerns, @tigrannajaryan!

Collector is a unique piece of infrastructure. It needs to be observable and debuggable even when the observability system is broken.

Agree, but I think it ends up depending on how we see/deal with the collector in production. If we have only a few instances (pet), we are likely to look at logs for the individual instances. However, for highly elastic scenarios (cattle), we'd rather have the logs being sent elsewhere and post-processed. In which case, having it easy to be parsed by machines is preferable.

Based on the same train of thought, #2098 might probably not bring many benefits.

pkositsyn · 2021-01-28T11:20:06Z

My issue is closed as a duplicate, so I'd like to comment it here. I have a question about this point

Introduce a rate-limited logger that all components can use to log failures during operation. Exporter helper already does this for most exporters that use queued_retry. We will need similar capabilities for receivers, processors and extensions.

I think there is a simple solution to add a global system logs rate limit with a corresponding flag.

Why do I think it's actually enough for any purposes:

This is a somewhat hard limit for logs. You know that more logs than this limit won't give any information or more logs means overload of logs delivery pipeline or just a performance degradation due to excessive communication with console.
This limit is something you should never reach. Reaching means some critical error in code (usually logging every request meeting some condition). Again, this is just safety of the whole pipeline.

Even if you don't like the solution that I offer, this is a critical thing we cannot limit the total number of logs. Some implemented components do logging on every request (maybe error request, but anyway) and it's very hard to get rid of all those places

jpkrohling · 2021-01-28T14:40:07Z

Some implemented components do logging on every request (maybe error request, but anyway)

I agree with your proposal as a whole, and I think our logger (zap) does support rate limiting. In any case, it would be good to have bug reports against those components. Logging on error is desirable, but we might be able to optimize hot paths...

pkositsyn · 2021-01-29T17:21:57Z

Yes, there is actually a possibility to tune logger. It's a pity I cannot do it from the box. Need to do manipulations with code. Is there a position on not exposing more flags?

jpkrohling · 2021-02-02T10:43:02Z

Is there a position on not exposing more flags?

Not that I'm aware of. Each new flag comes with the need to document and maintain it, but nothing will prevent us from adding one if they are justifiable.

* Correct status transform in OTLP exporter * Add changes to changelog

NickLarsenNZ · 2022-11-23T21:29:14Z

As for formatting, I think logfmt should be an option alongside JSON (and perhaps use the ZAP log structure). I find it far more readable raw, but still parsable (eg: within Grafana).

…pen-telemetry#2102) Fix open-telemetry#2101

tigrannajaryan added feature request help wanted Good issue for contributors to OpenTelemetry Service to pick up labels Nov 10, 2020

tigrannajaryan mentioned this issue Nov 10, 2020

Use "console" encoding for own logs #2106

Closed

andrewhsu added the priority:p2 Medium label Nov 11, 2020

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 18, 2020

Add --log-format command line option (default to "console")

8db2eef

Contributes to open-telemetry#2102

tigrannajaryan mentioned this issue Nov 18, 2020

Add --log-format command line option (default to "console") #2177

Merged

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Nov 18, 2020

Add --log-format command line option (default to "console")

ace05b8

Contributes to open-telemetry#2102

tigrannajaryan added a commit that referenced this issue Nov 18, 2020

Add --log-format command line option (default to "console") (#2177)

02781c0

Contributes to #2102

jpkrohling mentioned this issue Dec 2, 2020

Set gRPC logger to zap #2237

Closed

andrewhsu added enhancement New feature or request area:miscellaneous release:allowed-for-ga and removed feature request labels Dec 3, 2020

andrewhsu mentioned this issue Jan 28, 2021

Add ability to use aggregate logger in pipeline #2397

Closed

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021

Correct status transform in OTLP exporter (open-telemetry#2102)

63dfe64

* Correct status transform in OTLP exporter * Add changes to changelog

Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024

[BUILD] Build break with old curl, macro CURL_VERSION_BITS unknown (o…

82f2828

…pen-telemetry#2102) Fix open-telemetry#2101

github-actions bot added the Stale label Nov 24, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supportability: improve own logs #2102

Supportability: improve own logs #2102

tigrannajaryan commented Nov 10, 2020 •

edited

Loading

jpkrohling commented Nov 12, 2020 •

edited

Loading

tigrannajaryan commented Nov 17, 2020

tigrannajaryan commented Nov 18, 2020

jpkrohling commented Nov 23, 2020

pkositsyn commented Jan 28, 2021

jpkrohling commented Jan 28, 2021

pkositsyn commented Jan 29, 2021

jpkrohling commented Feb 2, 2021

NickLarsenNZ commented Nov 23, 2022 •

edited

Loading

Supportability: improve own logs #2102

Supportability: improve own logs #2102

Comments

tigrannajaryan commented Nov 10, 2020 • edited Loading

jpkrohling commented Nov 12, 2020 • edited Loading

tigrannajaryan commented Nov 17, 2020

tigrannajaryan commented Nov 18, 2020

jpkrohling commented Nov 23, 2020

pkositsyn commented Jan 28, 2021

jpkrohling commented Jan 28, 2021

pkositsyn commented Jan 29, 2021

jpkrohling commented Feb 2, 2021

NickLarsenNZ commented Nov 23, 2022 • edited Loading

tigrannajaryan commented Nov 10, 2020 •

edited

Loading

jpkrohling commented Nov 12, 2020 •

edited

Loading

NickLarsenNZ commented Nov 23, 2022 •

edited

Loading