-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate limit all unauthenticated HTTP endpoints #24623
Conversation
lib/defaults/defaults.go
Outdated
@@ -368,9 +368,9 @@ const ( | |||
LimiterPasswordlessPeriod = 1 * time.Minute | |||
// LimiterPasswordlessAverage is the default average for passwordless | |||
// limiters. | |||
LimiterPasswordlessAverage = 10 | |||
LimiterPasswordlessAverage = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are no longer specific to passwordless interactions, are they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK they are only used in passwordless interactions, but there may be something I am missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I think Zac is right, so the Passwordless
part should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 2f417cc
Let me know if another naming strikes you as better
lib/web/apiserver.go
Outdated
@@ -521,23 +521,23 @@ func (h *Handler) bindDefaultEndpoints() { | |||
// endpoint returns the default authentication method and configuration that | |||
// the server supports. the /webapi/ping/:connector endpoint can be used to | |||
// query the authentication configuration for a specific connector. | |||
h.GET("/webapi/ping", httplib.MakeHandler(h.ping)) | |||
h.GET("/webapi/ping/:connector", httplib.MakeHandler(h.pingWithConnector)) | |||
h.GET("/webapi/ping", h.WithLimiter(h.ping)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be a little bit concerned about limiting the ping endpoint, I think it may be used as part of the node joining or initial connection process. (we should confirm whether this is the case or not)
For example, suppose a customer has thousands of nodes in some private network with SNAT. It will appear as if all of these nodes are connecting from the same source IP and could trip the rate limiter and prevent nodes from joining or reconnecting after an upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4330 specifically calls out the file descriptors that are opened from ping. I understand the concern though.
In such a condition, would the reconnecting nodes retry until they are able to eventually get a rate allowed request?
Do you think any rate limit could be designed around ping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmb3 As an alternative solution, could the nodes start to send their ping
request authenticated? If so we could still make this endpoint "optionally unauthenticated", but let the rate limiter ignore requests which provide valid authentication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding an authenticated version of endpoints commonly used by authenticated clients, like find
, and ping
. Even better, we could use the gRPC version of these endpoints when we have an authenticated gRPC client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the ability to only rate limit unauthenticated requests in this commit: 289b611
As mentioned in the message luckily ping
was already providing authentication, so this logic worked pretty straight forward out of the box. I added this to other endpoints which might be good candidates to consider, although I did not verify they were providing auth like ping
was.
Please provide feedback on this change, thank you!
A change like this should be reserved for a major release and run through the full test plan including load testing and performance tests. This process will begin for Teleport 13 as early as Monday, which means this might have to wait until Teleport 14. |
I agree with the test plan @zmb3. I wont merge this to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add rate limiters to the enterprise only endpoints? For example, it looks like we need one here - https://github.com/gravitational/teleport.e/blob/53ef3ae994b5968b980fdff9704dae9baeb56ba4/lib/web/plugin.go#L163.
lib/defaults/defaults.go
Outdated
@@ -368,9 +368,9 @@ const ( | |||
LimiterPasswordlessPeriod = 1 * time.Minute | |||
// LimiterPasswordlessAverage is the default average for passwordless | |||
// limiters. | |||
LimiterPasswordlessAverage = 10 | |||
LimiterPasswordlessAverage = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I think Zac is right, so the Passwordless
part should be removed.
lib/web/apiserver.go
Outdated
@@ -521,23 +521,23 @@ func (h *Handler) bindDefaultEndpoints() { | |||
// endpoint returns the default authentication method and configuration that | |||
// the server supports. the /webapi/ping/:connector endpoint can be used to | |||
// query the authentication configuration for a specific connector. | |||
h.GET("/webapi/ping", httplib.MakeHandler(h.ping)) | |||
h.GET("/webapi/ping/:connector", httplib.MakeHandler(h.pingWithConnector)) | |||
h.GET("/webapi/ping", h.WithLimiter(h.ping)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding an authenticated version of endpoints commonly used by authenticated clients, like find
, and ping
. Even better, we could use the gRPC version of these endpoints when we have an authenticated gRPC client.
454925b
to
289b611
Compare
I will submit a PR to update to use the new |
@zmb3 Would you mind reviewing this PR? |
lib/web/apiserver.go
Outdated
@@ -521,23 +521,23 @@ func (h *Handler) bindDefaultEndpoints() { | |||
// endpoint returns the default authentication method and configuration that | |||
// the server supports. the /webapi/ping/:connector endpoint can be used to | |||
// query the authentication configuration for a specific connector. | |||
h.GET("/webapi/ping", httplib.MakeHandler(h.ping)) | |||
h.GET("/webapi/ping/:connector", httplib.MakeHandler(h.pingWithConnector)) | |||
h.GET("/webapi/ping", h.WithUnauthenticatedLimiter(h.ping)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a little nervous about ping. tsh login
hits this endpoint, so with the current settings, no more than 40 users can log in (from the same IP) during a 1 minute window.
There are several ways (SNAT, VPN gateway, etc) that it could appear that all users are connecting from the same IP.
What do you think about either using more generous limits for this endpoint, or starting with some metrics/logging and otherwise leaving it unlimited so that we can evaluate the impact of such a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about either using more generous limits for this endpoint, or starting with some metrics/logging and otherwise leaving it unlimited so that we can evaluate the impact of such a change?
I believe this is one of the more important endpoints to rate limit. What rate limit would make you comfortable? We could potentially keep the average at 20, but increase the burst to 100 or 120.
The goal in having this land right after v13 is cut is so that we can have as long as possible period to test and witness any unexpected interactions from this change.
Let me know your thoughts, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what limits would make me more comfortable, because I don't have any data to inform me as to what's reasonable and expected and what's abnormal.
If we put a prometheus metric in for requests/s to this endpoint (grouped by remote IP) we wouldn't have to collect data very long to get a good idea of what's normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3b0e68e
to
10c6968
Compare
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403. Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited. The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting: 1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another. 2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern. This commit chooses option #2. While #1 has advantages, particularly as endpoints and new use cases are added. #2 provides the strictest and safest rate limits. Our rate limits were configured to: period: 1 min avg rate: 10 burst rate: 20 In order to build a safety buffer with option #2 those allowed rates were doubled. Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added. This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.
This new `High` rate limit is designed for endpoints which are only CPU bound (and thus don't have as significant of DoS risks). Initially this was motivated for `ping` and `find` due to the concern that these endpoints are used unauthenticated at login, and potential NAT's may result in very high rates from single egress IP's. In my testing on my laptop, all of these endpoints can easily get 640/req/sec on a single core within a VM. Setting the maximum of 480 burst and 120 continuous should both ensure that no single source utilizes all the CPU, as well as build in additional safety margins while providing a layer of protection.
10c6968
to
3d31f86
Compare
This PR is an extension to what was done in PR #172. This PR is designed to fix #4330 and also fix https://github.com/gravitational/teleport-private/issues/403.
Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.
The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting:
This commit chooses option 2. While
opt 1
has advantages, particularly as endpoints and new use cases are added. Option 2 provides the strictest and safest rate limits.Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20
In order to build a safety buffer with option #2 those allowed rates were doubled.