-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: add debug doctor command #52622
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 6 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)
pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):
// CreateAsOfTime that it is now set. if t.Table.Adding() && t.Table.IsAs() && t.Table.CreateAsOfTime.IsEmpty() { log.Fatalf(context.TODO(), "table descriptor for %q (%d.%d) is in the "+
I wonder if it would make sense to error out with an assertion error instead of stopping the server in this condition.
pkg/sql/sqlbase/descriptor.go, line 238 at r1 (raw file):
// are rendered in some other downstream setting we expect the timestamp to // be read. This is a hack we shouldn't need to do. log.Errorf(context.TODO(), "read table descriptor for %q (%d) without ModificationTime "+
I think the behavior (error vs fatal) should be dependent on a boolean. In a running server, this error should ensure the caller does not get a chance to use the descriptor. See my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @spaskob)
pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):
Previously, knz (kena) wrote…
I wonder if it would make sense to error out with an assertion error instead of stopping the server in this condition.
@ajwerner suggested an approach to read the ts from crdb
_intenal and for debug.zip setting any non-zero ts makes the error go away. We could look into dumping the timestamp in debug.zip as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few general thoughts:
- you don't need to invest too much into polish since this is a debug command.
- however, you probably do want some tests. So I recommend some tests that injects invalid descriptors and verifies that the function finds the errors
- the commit message (and PR description) should be scoped down to what this PR actually does. It's OK to talk about "future plans" in an issue (which is not filed yet as far as I can see, so you should file it and refer to it from the PR). It's not OK to talk about future plans in the commit message. The commit message should describe objectively what's contained in the commit.
- the release note is IMHO premature since the work is not complete yet. If you do keep it, you need to start the release note with a description of what tool was introduced. Remember, the release note is pasted into docs without any reference to the PR or any title.
Reviewed 6 of 6 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)
pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
@ajwerner suggested an approach to read the ts from crdb
_intenal and for debug.zip setting any non-zero ts makes the error go away. We could look into dumping the timestamp in debug.zip as well.
The mvcc timestamp is now available from SQL, thanks to the changes by Rohan. Perhaps you can pick that up.
pkg/sql/sqlbase/descriptor.go, line 238 at r1 (raw file):
Previously, knz (kena) wrote…
I think the behavior (error vs fatal) should be dependent on a boolean. In a running server, this error should ensure the caller does not get a chance to use the descriptor. See my comment above.
I still stand by this point.
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the general remarks! I updated the PR description and linked 2 relevant issues. I will be adding the tests shortly.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):
Previously, knz (kena) wrote…
The mvcc timestamp is now available from SQL, thanks to the changes by Rohan. Perhaps you can pick that up.
Done.
pkg/sql/sqlbase/descriptor.go, line 238 at r1 (raw file):
Previously, knz (kena) wrote…
I still stand by this point.
This is unrelated to this PR anymore so I'd leave it to @ajwerner to decide if we need to file an issue to change this function prototype.
925c6d8
to
0117928
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty good for a MVP!
I just have a small suggestion below as to API.
I'd also like to renew my suggestion to unit test the Examine function in addition to the command-line test you already have, to test a larger variety of problem cases, and show the diversity of output messages that can be encountered.
I also recommend you rebase to recover from the flaky test that blocks your CI.
Reviewed 2 of 6 files at r1, 5 of 5 files at r3, 1 of 1 files at r4, 5 of 5 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)
pkg/sql/doctor/doctor.go, line 72 at r3 (raw file):
table := sqlbase.NewImmutableTableDescriptor(*sqlbase.TableFromDescriptor(desc, ts)) if int64(table.ID) != row.ID { fmt.Printf("Table %3d: different id in the descriptor: %d", row.ID, table.ID)
I recommend you replace the fmt.Printf
by fmt.Fprintf
and pass an io.Writer
as argument to the function.
This will make your testing easier, as this way you can redirect the output to e.g. a bytes.Buffer and compare strings directly.
9d5ce0b
to
3d86acc
Compare
PTAL, tests for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, just a few nits and this is good to go.
Reviewed 3 of 8 files at r6, 3 of 5 files at r8, 4 of 5 files at r9.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @spaskob)
pkg/cli/doctor.go, line 81 at r9 (raw file):
} // RunClusterDoctor runs the doctors tool reading data from a live cluster.
nit: if you don't use the functions in a different package, then also un-export them.
pkg/cli/doctor.go, line 138 at r9 (raw file):
// RunZipDirDoctor runs the doctors tool reading data from a debug zip dir. func RunZipDirDoctor(cmd *cobra.Command, args []string) (retErr error) {
ditto
pkg/sql/doctor/doctor.go, line 56 at r9 (raw file):
// Examine runs a suite of consistency checks over the descriptor table. func Examine(descTable []DescriptorTableRow, verbose bool, stdout io.Writer) (bool, error) {
Our style recommendation is to name the return values when it's not clear what they mean.
Generally, if a function returns a bool and it's not a predicate (ie a function named "IsXXXX" where the bool clearly says yes/no), then it's unclear what the bool is about.
So the "bool" return here should probably be named "ok".
pkg/sql/doctor/doctor.go, line 67 at r9 (raw file):
t := sqlbase.TableFromDescriptor(desc, hlc.Timestamp{}) if t == nil { continue
I recommend a comment here to explain what it means to ignore the table descriptor.
pkg/sql/doctor/doctor.go, line 79 at r9 (raw file):
if err := table.Validate(context.Background(), protoGetter, keys.SystemSQLCodec); err != nil { problemsFound = true fmt.Fprintf(stdout, "Table %3d: %s\n", table.ID, err.Error())
You don't need to call .Error()
on err
. Printf does it for you already.
pkg/sql/sqlbase/descriptor.go, line 239 at r9 (raw file):
// be read. This is a hack we shouldn't need to do. log.Fatalf(context.TODO(), "read table descriptor for %q (%d) without ModificationTime "+ "with zero MVCC timestamp; full descriptor:\n%s", GetDescriptorName(desc), GetDescriptorID(desc), desc.String())
FYI, printf-style formatting already calls .String()
on objects.
The only reason why we sometimes call .String()
explicitly is to avoid the object escaping to the heap, if it wasn't on the heap already.
In this particular case, you're receiving desc
by reference, so it's already on the heap, so no need to call .String()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
pkg/cli/doctor.go, line 138 at r9 (raw file):
Previously, knz (kena) wrote…
ditto
Done.
pkg/sql/doctor/doctor.go, line 56 at r9 (raw file):
Previously, knz (kena) wrote…
Our style recommendation is to name the return values when it's not clear what they mean.
Generally, if a function returns a bool and it's not a predicate (ie a function named "IsXXXX" where the bool clearly says yes/no), then it's unclear what the bool is about.So the "bool" return here should probably be named "ok".
Done.
pkg/sql/doctor/doctor.go, line 67 at r9 (raw file):
Previously, knz (kena) wrote…
I recommend a comment here to explain what it means to ignore the table descriptor.
Done.
pkg/sql/doctor/doctor.go, line 79 at r9 (raw file):
Previously, knz (kena) wrote…
You don't need to call
.Error()
onerr
. Printf does it for you already.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r12, 1 of 5 files at r13.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 9 files at r10, 1 of 5 files at r12, 3 of 5 files at r13.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @spaskob)
pkg/sql/doctor/doctor.go, line 11 at r13 (raw file):
// licenses/APL.txt. package doctor
nit: can you add a doc comment
c15f28f
to
6fba86e
Compare
This commit intorduces two new debug commands: `/cockroach debug doctor cluster --url=<cluster_conn_string>` and `/cockroach debug doctor zipdir <debug-dir>` which validate the descriptors read from a live cluster or from an unzipped debug zip directory. The commands print out all descriptor ids that were inspected and the problems with specific ids if found. Fixes cockroachdb#52077. Informs cockroachdb#51153. Release note: none.
We were not testing correctly for the length of the argument list which could result in segfault. Release note: none.
Release note: none.
Release note: none.
bors r+ |
Build succeeded: |
This PR introduces two new debug commands:
/cockroach debug doctor cluster --url=<cluster_conn_string>
and
/cockroach debug doctor zipdir <debug-dir>
which will validate the descriptors of a live cluster or from an unzipped debug
directory respectfully. The commands print out all descriptor ids that were inspected
and the problems with specific ids if found.
Fixes #52077.
Informs #51153.
Release note: none.