Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: require admin role to access node status #67067

Merged
merged 1 commit into from
Jul 22, 2021

Conversation

knz
Copy link
Contributor

@knz knz commented Jun 30, 2021

Release note (security update): The node status retrieval endpoints
over HTTP (/_status/nodes, /_status/nodes/<N> and the web UI
/#/reports/nodes) have been updated to require the admin role from
the requesting user. This ensures that operational details such as
network addresses and command-line flags do not leak to unprivileged
users.

@knz knz requested review from bdarnell, itsbilal and aaron-crl June 30, 2021 13:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

@nihalpednekar nihalpednekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from Bulk

}

// ListNodesInternal is a helper function for the benefit of SQL exclusively.
// It skips the privilege check, assuming that SQL is doing privilege checking already.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't SQL call the internal method with a context that has the appropriate user set? This seems to just invite the two methods to diverge in their access controls, or for one of them to be called inappropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "the appropriate user"? All the access by SQL needs to retrieve the full node descriptors, so "the appropriate user" would be admin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if any SQL user would inject the admin user into that context, it may as well not inject anything and not get a privilege check on the other side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I was hoping to have this PR not become a discussion about the proper API design for what's exposed to SQL. That's a problem we know we have already (separate issue - #60584) and I wish this PR to remain backportable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the privilege checking that we assume the SQL layer is doing? The one spot I checked was just doing RequireAdminUser, so it could use the public method.

I mean, if any SQL user would inject the admin user into that context, it may as well not inject anything and not get a privilege check on the other side.

I disagree - I think there's a difference between calling an "internal" no-access-control method and synthesizing a node context in order to call one that does do access control. In the latter case it's clear at the call site that you're crossing security boundaries and need to restrict what gets returned to your caller. In the former case it's less clear and it would be easier to simply return the sensitive data verbatim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the situation from my perspective:

  • the current code before this PR already does not do any authorization.
  • This PR is making the current situation better by forcing authorization for external accesses via HTTP.
  • You're now advocated for a richer internal API beyond what this PR is doing, in a way that is adding complexity which we aim to resolve when we fix sql: improper dependency on the RPC servers from SQL #60584.

Is this extension strictly needed to solve the missing http authorization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that you've added new surface area (the new ListNodesInternal method) that is always risky to use; every call site of this method must be audited to make sure it handles the results appropriately. The smallest change to fix the missing http authorization check is to simply add the missing check. This may break some things that were relying on the lack of auth, so we fix those (and only those) by making them explicitly elevate privileges.

Maybe it's more expedient to keep the status quo of unauthenticated access for non-HTTP entry points, but I'm worried that it sets us up for only partially solving the problem and we'll have to follow up by patching another one of these later. I'll OK this patch as long as you've manually verified each of the call sites of ListNodesInternal.

Copy link
Contributor Author

@knz knz Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enumerate the cases of how the Nodes() function could be called previously:

  1. as a RPC call via gRPC. For this, all the RPC methods are already authenticated; also they already require a node principal. This is still true today.
  2. as a HTTP API call. For this, all the HTTP endpoints are authenticated but this one specifically accepted any SQL user, not just those with admin role. This is the bug being fixed.
  3. as direct Go calls from other packages inside the process. These were not RPCs, nor meant to be subject to any authentication/authorization (and they are performed with process-level privilege, which is the highest they can be).

This PR proposes to fix the problem at point 2, without introducing additional complexity at point 3.

In fact, coming back to your sentence:

you've added new surface area (the new ListNodesInternal method) that is always risky to use; every call site of this method must be audited to make sure it handles the results appropriately

That is incorrect. I merely renamed the existing surface area "Nodes() as an internal Go method" to ListNodesInternal(). It's the same that was there before.

I'd like to point out that:

  • the PR does not add a new RPC endpoint for the new function ListNodesInternal. It is not listed in a .proto file and is only available as Go function to call from other packages.
  • we do not know of any privilege issues stemming today from direct Go calls of RPC handler methods from other packages without going through gRPC and its authentication system. Are there any? Is that a problem we need to fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as direct Go calls from other packages inside the process. These were not RPCs, nor meant to be subject to any authentication/authorization (and they are performed with process-level privilege, which is the highest they can be).

I don't agree that they were not meant to be subject to any authentication/authorization. Historically we have not done a good job of this, but we would ideally have privilege checks as close to the code that accesses sensitive data as possible.

we do not know of any privilege issues stemming today from direct Go calls of RPC handler methods from other packages without going through gRPC and its authentication system. Are there any? Is that a problem we need to fix?

We don't know of any other issues with ListNodesInternal today. But now that we've identified that this data is sensitive and we've missed auth checks on other paths, it's worth looking at the other ones to make sure they're covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have filed a follow-up issue to do this investigation: #67938

do you think this investigation should block this PR?
As of today, I have checked that only the nodes endpoint is used by SQL (and requires node privilege). I have not checked whether the KV endpoints or other endpoint handlers are used anywhere.

(My opinion remains that this investigation is not on the critical path to fixing the current vulnerability.)

If you want something specific to happen, please spell out the specific actions you'd like me to take.

Release note (security update): The node status retrieval endpoints
over HTTP (`/_status/nodes`, `/_status/nodes/<N>` and the web UI
`/#/reports/nodes`) have been updated to require the `admin` role from
the requesting user. This ensures that operational details such as
network addresses and command-line flags do not leak to unprivileged
users.
@knz knz force-pushed the 20210630-status branch from 9f4950b to 3618a80 Compare July 22, 2021 16:30
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@knz
Copy link
Contributor Author

knz commented Jul 22, 2021

TFYR

bors r=bdarnell

@craig
Copy link
Contributor

craig bot commented Jul 22, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jul 22, 2021

Build succeeded:

@craig craig bot merged commit 0119b2b into cockroachdb:master Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants