log: fatal log is not written inline in the log file #62768

andreimatei · 2021-03-29T23:38:03Z

I'm seeing that a log.Fatal only appears in the log file a second after it actually happens. This seems to have something to do with the log message being somehow redirected to the "dev channel"?
I don't quite understand what's going on, but I think it'd be better for the fatal to also appear inline at the point where it happens.
Right now I see this:

I210326 12:02:18.951783 195 1@kv/kvserver/store_raft.go:524 ⋮ [n9,s9,r9/8:‹/Table/1{3-4}›] 485  the server is terminating due to a fatal error (see the DEV channel for details)
...
<more message>
message #618
message #620
<more messages>
F210326 12:02:18.951852 195 kv/kvserver/store_raft.go:524 ⋮ [n9,s9,r9/8:‹/Table/1{3-4}›] 619  ...
<stack trace>

I would like to see (a copy of) the fatal message instead of the "the server is terminating due to a fatal error (see the DEV channel for details)" (this other message I think should not exist), because the point where the fatal happens is important.

Then, I'm a bit confused about how come I'm seeing other messages after a fatal (particularly .5s worth of them) - which fatal I think calls os.Exit() which I'd expect to be fast? And then I'm further confused about why the final message has a counter of 619 - there's probably an explanation, but that counter value doesn't seem to make sense with either the position of the "he server is terminating due to a fatal error" message, nor is it the highest counter in the file.

9.log

Jira issue: CRDB-2670

The text was updated successfully, but these errors were encountered:

tbg · 2021-04-12T09:47:43Z

cockroach/pkg/util/log/clog.go

Lines 352 to 355 in 8451341

    
           // Add a counter. This is important for e.g. the SQL audit logs. 
        
           // Note: whether the counter is displayed or not depends on 
        
           // the formatter. 
        
           editedEntry.counter = atomic.AddUint64(&s.msgCount, 1)

It doesn't look like any logs are held during msg seqno generation, so it would make sense that they can go out of order in the actual files. If I interpret your snippet above correctly, 619 (the stack trace) got reordered with 620. (The counter is per-sink, btw)

I think what's happening here is that in this repro above rendering the fatal error probably took a long time (they tend to be large!)
I toyed with

diff --git a/pkg/cli/start.go b/pkg/cli/start.go
index 80718edb50..fa95e24f93 100644
--- a/pkg/cli/start.go
+++ b/pkg/cli/start.go
@@ -59,6 +59,13 @@ import (
 	"google.golang.org/grpc"
 )
 
+type longTime struct{}
+
+func (longTime) String() string {
+	time.Sleep(5 * time.Second)
+	return "yawn"
+}
+
 // startCmd starts a node by initializing the stores and joining
 // the cluster.
 var startCmd = &cobra.Command{
@@ -688,6 +695,8 @@ If problems persist, please see %s.`
 				fmt.Print(msgS.StripMarkers())
 			}
 
+			time.Sleep(5 * time.Second)
+			log.Fatalf(ctx, "exploding at exactly %s but delayed via %s", timeutil.Now(), longTime{})
 			return nil
 		}(); err != nil {
 			errChan <- err

and it behaves like what you describe above. If I take out longTime{} it's snappy.

knz · 2021-04-12T12:21:15Z

I'm seeing that a log.Fatal only appears in the log file a second after it actually happens.

As Tobias explained, it probably has to do with goroutine scheduling + the cost of packaging the fatal error + the cost of sending the crash report to Sentry before the process is terminated

This seems to have something to do with the log message being somehow redirected to the "dev channel"?

I don't see evidence this is the case (if you have it, please point me to it)

I don't quite understand what's going on, but I think it'd be better for the fatal to also appear inline at the point where it happens.

The serial number indicates the point at which it happens.

I would like to see (a copy of) the fatal message instead of the "the server is terminating due to a fatal error (see the DEV channel for details)"

As explained above, the fatal message takes a while to package and emit. It's not being forgotten. The precise point can be derived from the sequence number.

(this other message I think should not exist)

This exists because if there are multiple sinks and the error occurs while writing to one sink, it may only be reported to another sink. We'll want more testing / tutorials to show and tell what this mechanism is about.

Then, I'm a bit confused about how come I'm seeing other messages after a fatal (particularly .5s worth of them) - which fatal I think calls os.Exit() which I'd expect to be fast?

Discussed above.

And then I'm further confused about why the final message has a counter of 619 - there's probably an explanation, but that counter value doesn't seem to make sense with either the position of the "he server is terminating due to a fatal error" message, nor is it the highest counter in the file.

Discussed above.

knz · 2021-04-12T12:22:16Z

From the previous exchange, I understand that we should clarify (in user-facing docs, and in the API docs) that logging is now a concurrent package and that no global lock is held, and so the sequence number should be used to "understand" the ordering with which the messages are emitted.

Is there any other remaining work needed?

knz · 2023-01-10T08:22:54Z

We actually had a bug there which got fixed in a later version.

andreimatei added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-logging In and around the logging infrastructure. labels Mar 29, 2021

andreimatei assigned knz Mar 29, 2021

knz mentioned this issue Mar 30, 2021

log/logconfig: change exit-on-error to default to true for stderr #62763

Closed

jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021

thtruo added T-observability-inf and removed T-server-and-security DB Server & Security labels Feb 18, 2022

knz closed this as completed Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log: fatal log is not written inline in the log file #62768

log: fatal log is not written inline in the log file #62768

andreimatei commented Mar 29, 2021 •

edited by cockroach-jira-scripts

Loading

tbg commented Apr 12, 2021

knz commented Apr 12, 2021

knz commented Apr 12, 2021

knz commented Jan 10, 2023

log: fatal log is not written inline in the log file #62768

log: fatal log is not written inline in the log file #62768

Comments

andreimatei commented Mar 29, 2021 • edited by cockroach-jira-scripts Loading

tbg commented Apr 12, 2021

knz commented Apr 12, 2021

knz commented Apr 12, 2021

knz commented Jan 10, 2023

andreimatei commented Mar 29, 2021 •

edited by cockroach-jira-scripts

Loading