Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement prioritized certificate fetching #391

Merged
merged 4 commits into from
Feb 16, 2023
Merged

Conversation

qfel
Copy link
Contributor

@qfel qfel commented Feb 13, 2023

Provides the interface requested in #334.

This allows all certificate requests to be assigned a priority. The SecretManager will issue a configurable number of concurrent requests in the order of priority and request time (priority always wins). In particular, high-priority calls inserted later can skip lower-priority calls in the queue.

Certificate refreshes are queued at the lowest (Background) priority.

The new functionality is not used yet - just available as a new method in SecretManager (and its implementation used to back the existing fetch_certificates call). For code simplicity, SecretManager was changed to have no parallelism (but allow configurable concurrency). The code isn't as computationally heavy to require multiple cores and most of the time should be spent waiting on certificate fetches. This in particular lets us reduce synchronization and data ownership issues.

Unit tests proved tricky - they rely on tokio's time control facilities. Those are pretty hard to get right, and pretty hard to debug. Ideally tokio would evolve to at least make this type of testing more debuggable (eg. panic on auto-advance for specific blocks of code). We may decide maintaining those tests is too much burden, but at least for the moment they let us test code without having to add test-only instrumentation.

@istio-testing istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 13, 2023
Copy link
Contributor

@briansonnenberg briansonnenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. Thank you for cleaning up my novice Rust code as well. :)

None => break 'main,
},
Some(res) = workers.next() => match res {
Err(_) => break 'main,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does breaking out of main here mean that if a particular worker request fails, even for non-critical reasons, the client will no longer process requests? Does it make sense to avoid that and have some kind of backoff retry / error handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workers are refresh(...) calls, which can return only watch::error::SendError, which cannot really happen because the recv end of the channel is kept alive via self. Figured no reason to panic.

Though I just realized that the other break cannot happen because the send end of requests is also referenced via self. So this code leaks the background worker - which AFAIR is not a regression, if you want I can submit as-is and fix it later.

Copy link
Contributor Author

@qfel qfel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot reproduce test failures with cargo clippy --benches --tests --bins. What am I missing?

BTW you probably want to add --no-deps flag in the test.1

None => break 'main,
},
Some(res) = workers.next() => match res {
Err(_) => break 'main,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workers are refresh(...) calls, which can return only watch::error::SendError, which cannot really happen because the recv end of the channel is kept alive via self. Figured no reason to panic.

Though I just realized that the other break cannot happen because the send end of requests is also referenced via self. So this code leaks the background worker - which AFAIR is not a regression, if you want I can submit as-is and fix it later.

#[derive(Clone)]
pub struct CaClient {
cfg: ClientConfig,
state: Arc<RwLock<ClientState>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of having a state struct holding the fetches field instead of defining it here with type Arc<RwLock<Vec<Identity>>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I was thinking of putting some more fields there but then abandoned the idea. Can replace ClientState with Vec if anybody cares.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind it like this + its mock code so not too concerned

impl Converter {
pub fn new() -> Self {
Self {
sys_now: SystemTime::now(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this nailing the now time to the time the struct instance was created? And any function call (even delayed ones) will refer to this time instead of now time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The now and sys_now values can be arbitrary, they only have o represent the ~same moment in time so that we can convert. One of those is always subtracted and the other added, so they always cancel each other.

#[derive(Clone)]
pub struct CaClient {
cfg: ClientConfig,
state: Arc<RwLock<ClientState>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind it like this + its mock code so not too concerned

This uses a single task for all certificates for simplicity. As the actual work should be mostly IO-bound, I don't expect the lack of true parallelism to be an issue.
@qfel
Copy link
Contributor Author

qfel commented Feb 15, 2023

/retest

Fixed background worker termination, cleaned up code a bit (mostly naming, very little structure changed).

@qfel
Copy link
Contributor Author

qfel commented Feb 16, 2023

/retest

@qfel
Copy link
Contributor Author

qfel commented Feb 16, 2023

/retest

@qfel
Copy link
Contributor Author

qfel commented Feb 16, 2023

/retest

@istio-testing istio-testing merged commit a849147 into istio:master Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants