-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd-dump-logs: Expand to allow diagnosing CRC corrupted problems in WAL log #15043
Conversation
7e96db0
to
66264e4
Compare
tools/etcd-dump-logs/raw.go
Outdated
} | ||
for _, finfo := range files { | ||
if filepath.Ext(finfo.Name()) != ".wal" { | ||
lg.Warn("Ignoring not .wal file", zap.String("filename", finfo.Name())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg.Warn("Ignoring not .wal file", zap.String("filename", finfo.Name())) | |
lg.Warn("Ignoring not .wal file", zap.String("filename", finfo.Name())) | |
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't resolved yet. Did you intentionally include files which do not suffixed with ".wal"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry - missed this. I'm mimicking the logic of the ReadAll that do skips not .wal files.
tools/etcd-dump-logs/raw.go
Outdated
continue | ||
} | ||
if errors.Is(err, io.EOF) { | ||
lg.Info("EOF: All entries were processed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that log
, zap.Logger
and fmt
are used to output messages in this PR. Suggest to keep it consistent with the existing behavior: Use fmt
to output normal messages and log
to output error or warning messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also some other places which need updated. Please search "lg." in main.go
and raw.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought log & lg is long term the same logging mechanism as we do in etcd.
Migrated to log for now.
3997adb
to
fbc43f9
Compare
4d434e4
to
0bd424e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Piotr Tabor <[email protected]>
Signed-off-by: Piotr Tabor <[email protected]>
Such that can be used by tools. Signed-off-by: Piotr Tabor <[email protected]>
Signed-off-by: Piotr Tabor <[email protected]>
This mode allows to look at RAW protos for all entries in WAL logs in the given directory. Signed-off-by: Piotr Tabor <[email protected]>
Signed-off-by: Piotr Tabor <[email protected]>
Signed-off-by: Piotr Tabor <[email protected]>
Signed-off-by: Piotr Tabor <[email protected]>
0bd424e
to
007858d
Compare
Thank you for review. |
Before this PR diagnosing etcd not running due to crc wal log corruption landed with:
So we just now CRC got corrupted ... somewhere...
With the PR:
In the dump file we see the exact record that got corrupted (index: 38856823):
We see that there were snapshots post this corruption. Last one:
Snapshot: index:38873895 term:32
So there were 38882697-38856823= 25874 entries written after the corruption.
Ideally we should have a tool (based on the improved in this PR decoder) that allows to trim the WAL log file to contain just the last snapshot and all the follow up records (or just the entries that have 'index>='
consistent_index
taken from bbolt.