Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate limit all unauthenticated HTTP endpoints #24623

Merged
merged 3 commits into from
May 1, 2023
Merged

Conversation

jentfoo
Copy link
Contributor

@jentfoo jentfoo commented Apr 14, 2023

This PR is an extension to what was done in PR #172. This PR is designed to fix #4330 and also fix https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use. There are two considered strategies to make sure this does not become impacting:

  1. Adjust the rate limiter so the rate limit becomes endpoint specific. This would avoid the need to consider how activity on one endpoint effects another.
  2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option 2. While opt 1 has advantages, particularly as endpoints and new use cases are added. Option 2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

@jentfoo jentfoo self-assigned this Apr 14, 2023
@github-actions github-actions bot requested a review from klizhentas April 14, 2023 19:28
@@ -368,9 +368,9 @@ const (
LimiterPasswordlessPeriod = 1 * time.Minute
// LimiterPasswordlessAverage is the default average for passwordless
// limiters.
LimiterPasswordlessAverage = 10
LimiterPasswordlessAverage = 20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are no longer specific to passwordless interactions, are they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK they are only used in passwordless interactions, but there may be something I am missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, I think Zac is right, so the Passwordless part should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 2f417cc

Let me know if another naming strikes you as better

@@ -521,23 +521,23 @@ func (h *Handler) bindDefaultEndpoints() {
// endpoint returns the default authentication method and configuration that
// the server supports. the /webapi/ping/:connector endpoint can be used to
// query the authentication configuration for a specific connector.
h.GET("/webapi/ping", httplib.MakeHandler(h.ping))
h.GET("/webapi/ping/:connector", httplib.MakeHandler(h.pingWithConnector))
h.GET("/webapi/ping", h.WithLimiter(h.ping))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be a little bit concerned about limiting the ping endpoint, I think it may be used as part of the node joining or initial connection process. (we should confirm whether this is the case or not)

For example, suppose a customer has thousands of nodes in some private network with SNAT. It will appear as if all of these nodes are connecting from the same source IP and could trip the rate limiter and prevent nodes from joining or reconnecting after an upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4330 specifically calls out the file descriptors that are opened from ping. I understand the concern though.

In such a condition, would the reconnecting nodes retry until they are able to eventually get a rate allowed request?

Do you think any rate limit could be designed around ping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmb3 As an alternative solution, could the nodes start to send their ping request authenticated? If so we could still make this endpoint "optionally unauthenticated", but let the rate limiter ignore requests which provide valid authentication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding an authenticated version of endpoints commonly used by authenticated clients, like find, and ping. Even better, we could use the gRPC version of these endpoints when we have an authenticated gRPC client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the ability to only rate limit unauthenticated requests in this commit: 289b611

As mentioned in the message luckily ping was already providing authentication, so this logic worked pretty straight forward out of the box. I added this to other endpoints which might be good candidates to consider, although I did not verify they were providing auth like ping was.

Please provide feedback on this change, thank you!

@zmb3
Copy link
Collaborator

zmb3 commented Apr 14, 2023

A change like this should be reserved for a major release and run through the full test plan including load testing and performance tests.

This process will begin for Teleport 13 as early as Monday, which means this might have to wait until Teleport 14.

@jentfoo
Copy link
Contributor Author

jentfoo commented Apr 14, 2023

I agree with the test plan @zmb3. I wont merge this to master until 13.0.0 has been released

Copy link
Contributor

@Joerger Joerger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add rate limiters to the enterprise only endpoints? For example, it looks like we need one here - https://github.com/gravitational/teleport.e/blob/53ef3ae994b5968b980fdff9704dae9baeb56ba4/lib/web/plugin.go#L163.

@@ -368,9 +368,9 @@ const (
LimiterPasswordlessPeriod = 1 * time.Minute
// LimiterPasswordlessAverage is the default average for passwordless
// limiters.
LimiterPasswordlessAverage = 10
LimiterPasswordlessAverage = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, I think Zac is right, so the Passwordless part should be removed.

@@ -521,23 +521,23 @@ func (h *Handler) bindDefaultEndpoints() {
// endpoint returns the default authentication method and configuration that
// the server supports. the /webapi/ping/:connector endpoint can be used to
// query the authentication configuration for a specific connector.
h.GET("/webapi/ping", httplib.MakeHandler(h.ping))
h.GET("/webapi/ping/:connector", httplib.MakeHandler(h.pingWithConnector))
h.GET("/webapi/ping", h.WithLimiter(h.ping))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding an authenticated version of endpoints commonly used by authenticated clients, like find, and ping. Even better, we could use the gRPC version of these endpoints when we have an authenticated gRPC client.

@jentfoo jentfoo force-pushed the jent/rate_limiting branch from 454925b to 289b611 Compare April 18, 2023 20:00
@jentfoo
Copy link
Contributor Author

jentfoo commented Apr 18, 2023

Should we also add rate limiters to the enterprise only endpoints?

I will submit a PR to update to use the new WithUnauthenticatedLimiter once we merge this PR. Thank you for spotting @Joerger!

@jentfoo
Copy link
Contributor Author

jentfoo commented Apr 24, 2023

@zmb3 Would you mind reviewing this PR?

@@ -521,23 +521,23 @@ func (h *Handler) bindDefaultEndpoints() {
// endpoint returns the default authentication method and configuration that
// the server supports. the /webapi/ping/:connector endpoint can be used to
// query the authentication configuration for a specific connector.
h.GET("/webapi/ping", httplib.MakeHandler(h.ping))
h.GET("/webapi/ping/:connector", httplib.MakeHandler(h.pingWithConnector))
h.GET("/webapi/ping", h.WithUnauthenticatedLimiter(h.ping))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a little nervous about ping. tsh login hits this endpoint, so with the current settings, no more than 40 users can log in (from the same IP) during a 1 minute window.

There are several ways (SNAT, VPN gateway, etc) that it could appear that all users are connecting from the same IP.

What do you think about either using more generous limits for this endpoint, or starting with some metrics/logging and otherwise leaving it unlimited so that we can evaluate the impact of such a change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about either using more generous limits for this endpoint, or starting with some metrics/logging and otherwise leaving it unlimited so that we can evaluate the impact of such a change?

I believe this is one of the more important endpoints to rate limit. What rate limit would make you comfortable? We could potentially keep the average at 20, but increase the burst to 100 or 120.

The goal in having this land right after v13 is cut is so that we can have as long as possible period to test and witness any unexpected interactions from this change.

Let me know your thoughts, thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what limits would make me more comfortable, because I don't have any data to inform me as to what's reasonable and expected and what's abnormal.

If we put a prometheus metric in for requests/s to this endpoint (grouped by remote IP) we wouldn't have to collect data very long to get a good idea of what's normal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmb3 and @Joerger Can you please review the latest commit: 3d31f86

There I introduce a high rate limiter for these select endpoints which are only CPU bound. After discussions with @zmb3 I believe this to be the safest balance, but please give me your candid thoughts. Thank you!

@jentfoo jentfoo changed the title Rate limit all unauthenticated endpoints Rate limit all unauthenticated HTTP endpoints Apr 25, 2023
@jentfoo jentfoo force-pushed the jent/rate_limiting branch from 3b0e68e to 10c6968 Compare April 25, 2023 15:09
jentfoo added 2 commits April 27, 2023 10:32
This commit is an extension to what was done in #172. And is designed to fix #4330 and https://github.com/gravitational/teleport-private/issues/403.

Rather than audit endpoints and choose what endpoints should be rate limited, this commit proposes that for safety and reduced cognitive load, all unauthenticated endpoints become rate limited.

The primary concern in this type of change would be if our rate limit becomes too aggressive for general use.  There are two considered strategies to make sure this does not become impacting:
1. Adjust the rate limiter so the rate limit becomes endpoint specific.  This would avoid the need to consider how activity on one endpoint effects another.
2. Accept that rate limit interactions are possible and instead ensure rate limits are high enough to avoid this concern.

This commit chooses option #2.  While #1 has advantages, particularly as endpoints and new use cases are added.  #2 provides the strictest and safest rate limits.

Our rate limits were configured to:
period: 1 min
avg rate: 10
burst rate: 20

In order to build a safety buffer with option #2 those allowed rates were doubled.

Additionally the ability to avoid rate limits by authenticating your request (even if the endpoint is otherwise unauthenticated) was added.  This is particularly useful for the `ping` endpoint which may have high levels of activity on large clusters, but which has a portion of that activity over authenticated requests.
This new `High` rate limit is designed for endpoints which are only CPU bound (and thus don't have as significant of DoS risks).
Initially this was motivated for `ping` and `find` due to the concern that these endpoints are used unauthenticated at login, and potential NAT's may result in very high rates from single egress IP's.
In my testing on my laptop, all of these endpoints can easily get 640/req/sec on a single core within a VM.  Setting the maximum of 480 burst and 120 continuous should both ensure that no single source utilizes all the CPU, as well as build in additional safety margins while providing a layer of protection.
@jentfoo jentfoo force-pushed the jent/rate_limiting branch from 10c6968 to 3d31f86 Compare April 27, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rate-limit proxy web API endpoints
3 participants