-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: require admin role to access node status #67067
Conversation
2422218
to
9f4950b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from Bulk
} | ||
|
||
// ListNodesInternal is a helper function for the benefit of SQL exclusively. | ||
// It skips the privilege check, assuming that SQL is doing privilege checking already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't SQL call the internal method with a context that has the appropriate user set? This seems to just invite the two methods to diverge in their access controls, or for one of them to be called inappropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "the appropriate user"? All the access by SQL needs to retrieve the full node descriptors, so "the appropriate user" would be admin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if any SQL user would inject the admin
user into that context, it may as well not inject anything and not get a privilege check on the other side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I was hoping to have this PR not become a discussion about the proper API design for what's exposed to SQL. That's a problem we know we have already (separate issue - #60584) and I wish this PR to remain backportable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the privilege checking that we assume the SQL layer is doing? The one spot I checked was just doing RequireAdminUser, so it could use the public method.
I mean, if any SQL user would inject the admin user into that context, it may as well not inject anything and not get a privilege check on the other side.
I disagree - I think there's a difference between calling an "internal" no-access-control method and synthesizing a node
context in order to call one that does do access control. In the latter case it's clear at the call site that you're crossing security boundaries and need to restrict what gets returned to your caller. In the former case it's less clear and it would be easier to simply return the sensitive data verbatim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the situation from my perspective:
- the current code before this PR already does not do any authorization.
- This PR is making the current situation better by forcing authorization for external accesses via HTTP.
- You're now advocated for a richer internal API beyond what this PR is doing, in a way that is adding complexity which we aim to resolve when we fix sql: improper dependency on the RPC servers from SQL #60584.
Is this extension strictly needed to solve the missing http authorization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that you've added new surface area (the new ListNodesInternal method) that is always risky to use; every call site of this method must be audited to make sure it handles the results appropriately. The smallest change to fix the missing http authorization check is to simply add the missing check. This may break some things that were relying on the lack of auth, so we fix those (and only those) by making them explicitly elevate privileges.
Maybe it's more expedient to keep the status quo of unauthenticated access for non-HTTP entry points, but I'm worried that it sets us up for only partially solving the problem and we'll have to follow up by patching another one of these later. I'll OK this patch as long as you've manually verified each of the call sites of ListNodesInternal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's enumerate the cases of how the Nodes()
function could be called previously:
- as a RPC call via gRPC. For this, all the RPC methods are already authenticated; also they already require a
node
principal. This is still true today. - as a HTTP API call. For this, all the HTTP endpoints are authenticated but this one specifically accepted any SQL user, not just those with
admin
role. This is the bug being fixed. - as direct Go calls from other packages inside the process. These were not RPCs, nor meant to be subject to any authentication/authorization (and they are performed with process-level privilege, which is the highest they can be).
This PR proposes to fix the problem at point 2, without introducing additional complexity at point 3.
In fact, coming back to your sentence:
you've added new surface area (the new ListNodesInternal method) that is always risky to use; every call site of this method must be audited to make sure it handles the results appropriately
That is incorrect. I merely renamed the existing surface area "Nodes()
as an internal Go method" to ListNodesInternal()
. It's the same that was there before.
I'd like to point out that:
- the PR does not add a new RPC endpoint for the new function
ListNodesInternal
. It is not listed in a .proto file and is only available as Go function to call from other packages. - we do not know of any privilege issues stemming today from direct Go calls of RPC handler methods from other packages without going through gRPC and its authentication system. Are there any? Is that a problem we need to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as direct Go calls from other packages inside the process. These were not RPCs, nor meant to be subject to any authentication/authorization (and they are performed with process-level privilege, which is the highest they can be).
I don't agree that they were not meant to be subject to any authentication/authorization. Historically we have not done a good job of this, but we would ideally have privilege checks as close to the code that accesses sensitive data as possible.
we do not know of any privilege issues stemming today from direct Go calls of RPC handler methods from other packages without going through gRPC and its authentication system. Are there any? Is that a problem we need to fix?
We don't know of any other issues with ListNodesInternal today. But now that we've identified that this data is sensitive and we've missed auth checks on other paths, it's worth looking at the other ones to make sure they're covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have filed a follow-up issue to do this investigation: #67938
do you think this investigation should block this PR?
As of today, I have checked that only the nodes endpoint is used by SQL (and requires node privilege). I have not checked whether the KV endpoints or other endpoint handlers are used anywhere.
(My opinion remains that this investigation is not on the critical path to fixing the current vulnerability.)
If you want something specific to happen, please spell out the specific actions you'd like me to take.
Release note (security update): The node status retrieval endpoints over HTTP (`/_status/nodes`, `/_status/nodes/<N>` and the web UI `/#/reports/nodes`) have been updated to require the `admin` role from the requesting user. This ensures that operational details such as network addresses and command-line flags do not leak to unprivileged users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TFYR bors r=bdarnell |
Build failed (retrying...): |
Build succeeded: |
Release note (security update): The node status retrieval endpoints
over HTTP (
/_status/nodes
,/_status/nodes/<N>
and the web UI/#/reports/nodes
) have been updated to require theadmin
role fromthe requesting user. This ensures that operational details such as
network addresses and command-line flags do not leak to unprivileged
users.