Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log: fatal log is not written inline in the log file #62768

Closed
andreimatei opened this issue Mar 29, 2021 · 4 comments
Closed

log: fatal log is not written inline in the log file #62768

andreimatei opened this issue Mar 29, 2021 · 4 comments
Assignees
Labels
A-logging In and around the logging infrastructure. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Mar 29, 2021

I'm seeing that a log.Fatal only appears in the log file a second after it actually happens. This seems to have something to do with the log message being somehow redirected to the "dev channel"?
I don't quite understand what's going on, but I think it'd be better for the fatal to also appear inline at the point where it happens.
Right now I see this:

I210326 12:02:18.951783 195 1@kv/kvserver/store_raft.go:524 ⋮ [n9,s9,r9/8:‹/Table/1{3-4}›] 485  the server is terminating due to a fatal error (see the DEV channel for details)
...
<more message>
message #618
message #620
<more messages>
F210326 12:02:18.951852 195 kv/kvserver/store_raft.go:524 ⋮ [n9,s9,r9/8:‹/Table/1{3-4}›] 619  ...
<stack trace>

I would like to see (a copy of) the fatal message instead of the "the server is terminating due to a fatal error (see the DEV channel for details)" (this other message I think should not exist), because the point where the fatal happens is important.

Then, I'm a bit confused about how come I'm seeing other messages after a fatal (particularly .5s worth of them) - which fatal I think calls os.Exit() which I'd expect to be fast? And then I'm further confused about why the final message has a counter of 619 - there's probably an explanation, but that counter value doesn't seem to make sense with either the position of the "he server is terminating due to a fatal error" message, nor is it the highest counter in the file.

9.log

Jira issue: CRDB-2670

@andreimatei andreimatei added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-logging In and around the logging infrastructure. labels Mar 29, 2021
@tbg
Copy link
Member

tbg commented Apr 12, 2021

// Add a counter. This is important for e.g. the SQL audit logs.
// Note: whether the counter is displayed or not depends on
// the formatter.
editedEntry.counter = atomic.AddUint64(&s.msgCount, 1)

It doesn't look like any logs are held during msg seqno generation, so it would make sense that they can go out of order in the actual files. If I interpret your snippet above correctly, 619 (the stack trace) got reordered with 620. (The counter is per-sink, btw)

I think what's happening here is that in this repro above rendering the fatal error probably took a long time (they tend to be large!)
I toyed with

diff --git a/pkg/cli/start.go b/pkg/cli/start.go
index 80718edb50..fa95e24f93 100644
--- a/pkg/cli/start.go
+++ b/pkg/cli/start.go
@@ -59,6 +59,13 @@ import (
 	"google.golang.org/grpc"
 )
 
+type longTime struct{}
+
+func (longTime) String() string {
+	time.Sleep(5 * time.Second)
+	return "yawn"
+}
+
 // startCmd starts a node by initializing the stores and joining
 // the cluster.
 var startCmd = &cobra.Command{
@@ -688,6 +695,8 @@ If problems persist, please see %s.`
 				fmt.Print(msgS.StripMarkers())
 			}
 
+			time.Sleep(5 * time.Second)
+			log.Fatalf(ctx, "exploding at exactly %s but delayed via %s", timeutil.Now(), longTime{})
 			return nil
 		}(); err != nil {
 			errChan <- err

and it behaves like what you describe above. If I take out longTime{} it's snappy.

@knz
Copy link
Contributor

knz commented Apr 12, 2021

I'm seeing that a log.Fatal only appears in the log file a second after it actually happens.

As Tobias explained, it probably has to do with goroutine scheduling + the cost of packaging the fatal error + the cost of sending the crash report to Sentry before the process is terminated

This seems to have something to do with the log message being somehow redirected to the "dev channel"?

I don't see evidence this is the case (if you have it, please point me to it)

I don't quite understand what's going on, but I think it'd be better for the fatal to also appear inline at the point where it happens.

The serial number indicates the point at which it happens.

I would like to see (a copy of) the fatal message instead of the "the server is terminating due to a fatal error (see the DEV channel for details)"

As explained above, the fatal message takes a while to package and emit. It's not being forgotten. The precise point can be derived from the sequence number.

(this other message I think should not exist)

This exists because if there are multiple sinks and the error occurs while writing to one sink, it may only be reported to another sink. We'll want more testing / tutorials to show and tell what this mechanism is about.

Then, I'm a bit confused about how come I'm seeing other messages after a fatal (particularly .5s worth of them) - which fatal I think calls os.Exit() which I'd expect to be fast?

Discussed above.

And then I'm further confused about why the final message has a counter of 619 - there's probably an explanation, but that counter value doesn't seem to make sense with either the position of the "he server is terminating due to a fatal error" message, nor is it the highest counter in the file.

Discussed above.

@knz
Copy link
Contributor

knz commented Apr 12, 2021

From the previous exchange, I understand that we should clarify (in user-facing docs, and in the API docs) that logging is now a concurrent package and that no global lock is held, and so the sequence number should be used to "understand" the ordering with which the messages are emitted.

Is there any other remaining work needed?

@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@thtruo thtruo added T-observability-inf and removed T-server-and-security DB Server & Security labels Feb 18, 2022
@knz
Copy link
Contributor

knz commented Jan 10, 2023

We actually had a bug there which got fixed in a later version.

@knz knz closed this as completed Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-logging In and around the logging infrastructure. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants