Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add debug doctor command #52622

Merged
merged 4 commits into from
Aug 20, 2020
Merged

cli: add debug doctor command #52622

merged 4 commits into from
Aug 20, 2020

Conversation

spaskob
Copy link
Contributor

@spaskob spaskob commented Aug 11, 2020

This PR introduces two new debug commands:
/cockroach debug doctor cluster --url=<cluster_conn_string>
and
/cockroach debug doctor zipdir <debug-dir>

which will validate the descriptors of a live cluster or from an unzipped debug
directory respectfully. The commands print out all descriptor ids that were inspected
and the problems with specific ids if found.

Fixes #52077.
Informs #51153.

Release note: none.

@spaskob spaskob requested a review from ajwerner August 11, 2020 11:41
@spaskob spaskob requested a review from a team as a code owner August 11, 2020 11:41
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):

		// CreateAsOfTime that it is now set.
		if t.Table.Adding() && t.Table.IsAs() && t.Table.CreateAsOfTime.IsEmpty() {
			log.Fatalf(context.TODO(), "table descriptor for %q (%d.%d) is in the "+

I wonder if it would make sense to error out with an assertion error instead of stopping the server in this condition.


pkg/sql/sqlbase/descriptor.go, line 238 at r1 (raw file):

		// are rendered in some other downstream setting we expect the timestamp to
		// be read. This is a hack we shouldn't need to do.
		log.Errorf(context.TODO(), "read table descriptor for %q (%d) without ModificationTime "+

I think the behavior (error vs fatal) should be dependent on a boolean. In a running server, this error should ensure the caller does not get a chance to use the descriptor. See my comment above.

@spaskob spaskob changed the title (WIP)cli: add debug doctor command cli: add debug doctor command Aug 13, 2020
Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @spaskob)


pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):

Previously, knz (kena) wrote…

I wonder if it would make sense to error out with an assertion error instead of stopping the server in this condition.

@ajwerner suggested an approach to read the ts from crdb
_intenal and for debug.zip setting any non-zero ts makes the error go away. We could look into dumping the timestamp in debug.zip as well.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general thoughts:

  • you don't need to invest too much into polish since this is a debug command.
  • however, you probably do want some tests. So I recommend some tests that injects invalid descriptors and verifies that the function finds the errors
  • the commit message (and PR description) should be scoped down to what this PR actually does. It's OK to talk about "future plans" in an issue (which is not filed yet as far as I can see, so you should file it and refer to it from the PR). It's not OK to talk about future plans in the commit message. The commit message should describe objectively what's contained in the commit.
  • the release note is IMHO premature since the work is not complete yet. If you do keep it, you need to start the release note with a description of what tool was introduced. Remember, the release note is pasted into docs without any reference to the PR or any title.

Reviewed 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

@ajwerner suggested an approach to read the ts from crdb
_intenal and for debug.zip setting any non-zero ts makes the error go away. We could look into dumping the timestamp in debug.zip as well.

The mvcc timestamp is now available from SQL, thanks to the changes by Rohan. Perhaps you can pick that up.


pkg/sql/sqlbase/descriptor.go, line 238 at r1 (raw file):

Previously, knz (kena) wrote…

I think the behavior (error vs fatal) should be dependent on a boolean. In a running server, this error should ensure the caller does not get a chance to use the descriptor. See my comment above.

I still stand by this point.

@blathers-crl
Copy link

blathers-crl bot commented Aug 13, 2020

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Aug 13, 2020
Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the general remarks! I updated the PR description and linked 2 relevant issues. I will be adding the tests shortly.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/sql/sqlbase/descriptor.go, line 221 at r1 (raw file):

Previously, knz (kena) wrote…

The mvcc timestamp is now available from SQL, thanks to the changes by Rohan. Perhaps you can pick that up.

Done.


pkg/sql/sqlbase/descriptor.go, line 238 at r1 (raw file):

Previously, knz (kena) wrote…

I still stand by this point.

This is unrelated to this PR anymore so I'd leave it to @ajwerner to decide if we need to file an issue to change this function prototype.

@spaskob spaskob force-pushed the tool branch 3 times, most recently from 925c6d8 to 0117928 Compare August 18, 2020 10:46
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty good for a MVP!
I just have a small suggestion below as to API.

I'd also like to renew my suggestion to unit test the Examine function in addition to the command-line test you already have, to test a larger variety of problem cases, and show the diversity of output messages that can be encountered.

I also recommend you rebase to recover from the flaky test that blocks your CI.

Reviewed 2 of 6 files at r1, 5 of 5 files at r3, 1 of 1 files at r4, 5 of 5 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/sql/doctor/doctor.go, line 72 at r3 (raw file):

		table := sqlbase.NewImmutableTableDescriptor(*sqlbase.TableFromDescriptor(desc, ts))
		if int64(table.ID) != row.ID {
			fmt.Printf("Table %3d: different id in the descriptor: %d", row.ID, table.ID)

I recommend you replace the fmt.Printf by fmt.Fprintf and pass an io.Writer as argument to the function.

This will make your testing easier, as this way you can redirect the output to e.g. a bytes.Buffer and compare strings directly.

@spaskob spaskob force-pushed the tool branch 4 times, most recently from 9d5ce0b to 3d86acc Compare August 19, 2020 07:01
@spaskob
Copy link
Contributor Author

spaskob commented Aug 19, 2020

PTAL, tests for doctor.Examine added.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, just a few nits and this is good to go.

Reviewed 3 of 8 files at r6, 3 of 5 files at r8, 4 of 5 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @spaskob)


pkg/cli/doctor.go, line 81 at r9 (raw file):

}

// RunClusterDoctor runs the doctors tool reading data from a live cluster.

nit: if you don't use the functions in a different package, then also un-export them.


pkg/cli/doctor.go, line 138 at r9 (raw file):

// RunZipDirDoctor runs the doctors tool reading data from a debug zip dir.
func RunZipDirDoctor(cmd *cobra.Command, args []string) (retErr error) {

ditto


pkg/sql/doctor/doctor.go, line 56 at r9 (raw file):

// Examine runs a suite of consistency checks over the descriptor table.
func Examine(descTable []DescriptorTableRow, verbose bool, stdout io.Writer) (bool, error) {

Our style recommendation is to name the return values when it's not clear what they mean.
Generally, if a function returns a bool and it's not a predicate (ie a function named "IsXXXX" where the bool clearly says yes/no), then it's unclear what the bool is about.

So the "bool" return here should probably be named "ok".


pkg/sql/doctor/doctor.go, line 67 at r9 (raw file):

		t := sqlbase.TableFromDescriptor(desc, hlc.Timestamp{})
		if t == nil {
			continue

I recommend a comment here to explain what it means to ignore the table descriptor.


pkg/sql/doctor/doctor.go, line 79 at r9 (raw file):

		if err := table.Validate(context.Background(), protoGetter, keys.SystemSQLCodec); err != nil {
			problemsFound = true
			fmt.Fprintf(stdout, "Table %3d: %s\n", table.ID, err.Error())

You don't need to call .Error() on err. Printf does it for you already.


pkg/sql/sqlbase/descriptor.go, line 239 at r9 (raw file):

		// be read. This is a hack we shouldn't need to do.
		log.Fatalf(context.TODO(), "read table descriptor for %q (%d) without ModificationTime "+
			"with zero MVCC timestamp; full descriptor:\n%s", GetDescriptorName(desc), GetDescriptorID(desc), desc.String())

FYI, printf-style formatting already calls .String() on objects.

The only reason why we sometimes call .String() explicitly is to avoid the object escaping to the heap, if it wasn't on the heap already.

In this particular case, you're receiving desc by reference, so it's already on the heap, so no need to call .String().

Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/cli/doctor.go, line 138 at r9 (raw file):

Previously, knz (kena) wrote…

ditto

Done.


pkg/sql/doctor/doctor.go, line 56 at r9 (raw file):

Previously, knz (kena) wrote…

Our style recommendation is to name the return values when it's not clear what they mean.
Generally, if a function returns a bool and it's not a predicate (ie a function named "IsXXXX" where the bool clearly says yes/no), then it's unclear what the bool is about.

So the "bool" return here should probably be named "ok".

Done.


pkg/sql/doctor/doctor.go, line 67 at r9 (raw file):

Previously, knz (kena) wrote…

I recommend a comment here to explain what it means to ignore the table descriptor.

Done.


pkg/sql/doctor/doctor.go, line 79 at r9 (raw file):

Previously, knz (kena) wrote…

You don't need to call .Error() on err. Printf does it for you already.

Done.

Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice :)

Reviewed 1 of 5 files at r12, 1 of 5 files at r13.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @knz)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 9 files at r10, 1 of 5 files at r12, 3 of 5 files at r13.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner, @knz, and @spaskob)


pkg/sql/doctor/doctor.go, line 11 at r13 (raw file):

// licenses/APL.txt.

package doctor

nit: can you add a doc comment

@spaskob spaskob force-pushed the tool branch 3 times, most recently from c15f28f to 6fba86e Compare August 20, 2020 08:11
Spas Bojanov added 4 commits August 20, 2020 11:43
This commit intorduces two new debug commands:

`/cockroach debug doctor cluster --url=<cluster_conn_string>`
and
`/cockroach debug doctor zipdir <debug-dir>`

which validate the descriptors read from a live cluster or from an
unzipped debug zip directory. The commands print out all descriptor ids
that were inspected and the problems with specific ids if found.

Fixes cockroachdb#52077.
Informs cockroachdb#51153.

Release note: none.
We were not testing correctly for the length of
the argument list which could result in segfault.

Release note: none.
Release note: none.
Release note: none.
@spaskob
Copy link
Contributor Author

spaskob commented Aug 20, 2020

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 20, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: schema verification tool over debug.zip
4 participants