Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd-dump-logs: Expand to allow diagnosing CRC corrupted problems in WAL log #15043

Merged
merged 8 commits into from
Dec 30, 2022

Conversation

ptabor
Copy link
Contributor

@ptabor ptabor commented Dec 23, 2022

Before this PR diagnosing etcd not running due to crc wal log corruption landed with:

$ go build ./tools/etcd-dump-logs/... && ./etcd-dump-logs ../../Downloads/data
{"level":"warn","msg":"found unexpected non-snap file; skipping","path":"gke-c10ed2ca2a3e4fb8a45f-ffba-8587271_status.txt"}
{"level":"warn","msg":"found unexpected non-snap file; skipping","path":"gke-c10ed2ca2a3e4fb8a45f-ffba-7612155_status.txt"}
Snapshot:
term=32 index=38873895 nodes=[b71f75320dc06a6c] confstate={"voters":[13195394291058371180],"auto_leave":false}
Start dumping log entries from snapshot.
2022/12/23 16:25:28 Failed reading WAL: walpb: crc mismatch

So we just now CRC got corrupted ... somewhere...

With the PR:

$ go build ./tools/etcd-dump-logs/... && ./etcd-dump-logs --raw ../../Downloads/data > dump

2022/12/23 16:27:11 Failed reading WAL: walpb: crc mismatch: expected: 1d9658e9 computed: 37beb207: in file '0000000000000215-0000000002502c07.wal' at position: 37124224
p

In the dump file we see the exact record that got corrupted (index: 38856823):

{"level":"warn","msg":"Reading entry failed with CRC error","error":"walpb: crc mismatch: expected: 1d9658e9 computed: 37beb207: in file '0000000000000215-0000000002502c07.wal' at position: 37124224"}
Entry: Term:32 Index:38856823 Data:"..."
HardState: term:32 vote:13195394291058371180 commit:38856823

We see that there were snapshots post this corruption. Last one: Snapshot: index:38873895 term:32

Entry: Term:32 Index:38882697 Data:"..."

HardState: term:32 vote:13195394291058371180 commit:38882697

So there were 38882697-38856823= 25874 entries written after the corruption.

Ideally we should have a tool (based on the improved in this PR decoder) that allows to trim the WAL log file to contain just the last snapshot and all the follow up records (or just the entries that have 'index>=' consistent_index taken from bbolt.

@ptabor ptabor force-pushed the 20221223-log-diagnostics branch 4 times, most recently from 7e96db0 to 66264e4 Compare December 28, 2022 15:28
}
for _, finfo := range files {
if filepath.Ext(finfo.Name()) != ".wal" {
lg.Warn("Ignoring not .wal file", zap.String("filename", finfo.Name()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lg.Warn("Ignoring not .wal file", zap.String("filename", finfo.Name()))
lg.Warn("Ignoring not .wal file", zap.String("filename", finfo.Name()))
continue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't resolved yet. Did you intentionally include files which do not suffixed with ".wal"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry - missed this. I'm mimicking the logic of the ReadAll that do skips not .wal files.

continue
}
if errors.Is(err, io.EOF) {
lg.Info("EOF: All entries were processed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that log, zap.Logger and fmt are used to output messages in this PR. Suggest to keep it consistent with the existing behavior: Use fmt to output normal messages and log to output error or warning messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some other places which need updated. Please search "lg." in main.go and raw.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought log & lg is long term the same logging mechanism as we do in etcd.
Migrated to log for now.

@ptabor ptabor force-pushed the 20221223-log-diagnostics branch from 3997adb to fbc43f9 Compare December 29, 2022 08:53
@ptabor ptabor requested review from ahrtr December 29, 2022 08:54
@ptabor ptabor force-pushed the 20221223-log-diagnostics branch from 4d434e4 to 0bd424e Compare December 29, 2022 11:47
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Such that can be used by tools.

Signed-off-by: Piotr Tabor <[email protected]>
This mode allows to look at RAW protos for all entries in WAL logs in the given directory.

Signed-off-by: Piotr Tabor <[email protected]>
@ptabor ptabor force-pushed the 20221223-log-diagnostics branch from 0bd424e to 007858d Compare December 30, 2022 08:22
@ptabor
Copy link
Contributor Author

ptabor commented Dec 30, 2022

Thank you for review.

@ptabor ptabor merged commit 4ae4d9f into etcd-io:main Dec 30, 2022
@ptabor ptabor deleted the 20221223-log-diagnostics branch December 30, 2022 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants