Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache event fanout & reversetunnel improvements #3536

Merged
merged 1 commit into from
Apr 23, 2020

Conversation

fspmarshall
Copy link
Contributor

@fspmarshall fspmarshall commented Apr 7, 2020

The PR includes multiple fixes intended to improve reliability of the caching and reversetunnel systems:

  • cache now perforams in-memory fanout of events, eliminating
    spurious event generation due to cache init/reset.

  • removed old unused logic from reversetunnel agents.

  • seekpool now directly controls active agent count via new
    semaphore-like lease system.

  • seekpool functionality simplified and split into two separate components; a "workpool" construct which functions like a per-cluster semaphore with variable targets, and a "tracker" which manages a record of which proxies might exist and which proxies are currently held by active agents.

  • add per-cluster jittered backoff to agent connection attempts to
    reduce "thundering herd" effect.

  • removed per-proxy and per-cluster backoff constructs in favor of a single jittered backoff at the AgentPool level (easier to maintain and test).

  • improved reversetunnel logging.

  • improved LB usage in tests.

Fixes #3468

Copy link
Contributor

@awly awly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style commends, since I'm not very familiar with the code yet

Copy link
Contributor

@klizhentas klizhentas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass on the first half, will do another pass shortly after

@fspmarshall fspmarshall force-pushed the fspmarshall/iot-debug-master branch from 1d99284 to 9ecf9a5 Compare April 15, 2020 16:07
@awly
Copy link
Contributor

awly commented Apr 15, 2020

https://github.com/golang/go/wiki/CodeReviewComments#interfaces
Please move all the interfaces to callers instead of implementation sites, return concrete types from constructors.

Copy link
Contributor

@russjones russjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, will continue tomorrow.

if err != nil && err != io.EOF {
p.log.Warnf("Proxy transport failed: %v %T.", trace.DebugReport(err), err)
}
case <-errorCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like before we were reporting errors other than io.EOF. What errors were you seeing here other than io.EOF that were not relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because this was just filling the logs with errors related to "use of closed network connection".

if count < 1 {
count = 1
}
p.wp.Set(key, uint64(count))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the count be picked up in the tick function and won't it be more accurate because it will have reduce count by the number of proxies that are expired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expiry is checked on a pretty long interval (30 seconds, though it is configurable). If we only updated counts at that point, the AgentPool would be very slow to react to new proxies being available (also, tests would take significantly longer).

@fspmarshall fspmarshall requested review from awly and russjones April 17, 2020 23:14
@fspmarshall fspmarshall force-pushed the fspmarshall/iot-debug-master branch from d0b0082 to 0e80232 Compare April 20, 2020 19:13
Copy link
Contributor

@klizhentas klizhentas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fspmarshall approved with minor nitpicks left here and there, please proceed with backport and ping me so I can load test.

@fspmarshall fspmarshall force-pushed the fspmarshall/iot-debug-master branch 3 times, most recently from ecf227c to 5384883 Compare April 22, 2020 22:10
"github.com/gravitational/teleport/lib/utils/workpool"
)

type Lease = workpool.Lease
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use a type alias here?
if it's just to shorten the code below, remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Lease type is the only part of the workpool API which isn't wrapped. Alias unifies the API and eliminates the need for consumer to import the workpool package, which is otherwise an implementation detail.

Comment on lines +48 to +56
// SetDefaults set default values for Config.
func (c *Config) SetDefaults() {
if c.ProxyExpiry < 1 {
c.ProxyExpiry = 3 * time.Minute
}
if c.TickRate < 1 {
c.TickRate = 30 * time.Second
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement this as a constructor func NewConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usable zero value with defaults set at time of usage is more consistent with how configuration structs are handled elsewhere in Teleport. Not 100% certain how the convention got started, but it helps with deserialization and with this handy pattern:

thing := NewThing(ThingConfig {
    Foo: "bar",
    // ...
})

Comment on lines +296 to +295
if strings.HasSuffix(name, p.key.Cluster) {
t := strings.TrimSuffix(name, p.key.Cluster)
if strings.HasSuffix(t, ".") {
name = strings.TrimSuffix(t, ".")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

name = strings.TrimSuffix(name, fmt.Sprintf(".%s", p.key.Cluster))

it will be a noop if there's no such suffix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimming Cluster and "." separately avoids unnecessary string allocation.

@fspmarshall fspmarshall force-pushed the fspmarshall/iot-debug-master branch from 6d78612 to 188e22c Compare April 23, 2020 19:57
- cache now perforams in-memory fanout of events, eliminating
spurious event generation due to cache init/reset.

- removed old unused logic from reversetunnel agents.

- replaced seekpool with simpler ttl-cache and semaphore-like
lease system.

- add jittered backoff to agent connection attempts to
reduce "thundering herd" effect.

- improved reversetunnel logging.

- improved LB usage in tests.
@fspmarshall fspmarshall force-pushed the fspmarshall/iot-debug-master branch from 188e22c to de246cf Compare April 23, 2020 20:33
@fspmarshall fspmarshall merged commit 4e9eed9 into master Apr 23, 2020
@fspmarshall fspmarshall deleted the fspmarshall/iot-debug-master branch April 23, 2020 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement connection backoff and backpressure for hanging reverse tunnels
4 participants