Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delta-xds: perform resource state tracking after resource ingestion #23214

Merged

Conversation

adisuissa
Copy link
Contributor

Commit Message: delta-xds: perform resource state tracking after resource ingestion
Additional Description:
Fixes delta-xDS protocol resource state tracking.
Prior to this PR, the implementation updated the state of a resource (tracked version, TTL), even if the fetched config was NACKed. This PR changes it to only update the tracked state after a resource was successfully ingested, and an ACK will be sent.
Note that this PR impacts the TTL of a resource, as now TTL timer starts after the resource was ingested rather than before.

Risk Level: Medium for delta-xDS usage.
Testing: Added unit tests.
Docs Changes: N/A.
Release Notes: Added release note.
Platform Specific Features: N/A.
Runtime guard: Can be reverted by setting envoy.reloadable_features.delta_xds_subscription_state_tracking_fix to false.
Fixes #20790

Signed-off-by: Adi Suissa-Peleg [email protected]

@adisuissa
Copy link
Contributor Author

/assign @htuch
One thing to consider whether TTL bookkeeping (starting a resource TTL timer when it is ingested) should be done before or after the resource was successfully ingested.

The TTL consideration is similar about SotW, and would like your opinion before changing that as well.

@stevenzzzz
Copy link
Contributor

Thanks for working on this.

There is still a risk tho it's much mitigated:

if there is an exception fired in the middle of accepting 1000 resources, say 500 has been updated(v0 --> V1), the 501th RouteConfiguration is rejected.
Now the consumers would have v1 for the first 500 customers, but that will not be reflected in the resourceState dictionary.

The root issue is that the resourceState dict is out of sync with each individual resource's "consumer" (RDSConfigProviderImpl for example). I think instead we could possibly update the per-resource-state by calling from each resource consumer's onConfigUpdateFailed/(add the onConfigUpdateSucceeded).

Which requires quite a bit plumbing.

@adisuissa
Copy link
Contributor Author

adisuissa commented Sep 22, 2022

Thanks for working on this.

There is still a risk tho it's much mitigated:

if there is an exception fired in the middle of accepting 1000 resources, say 500 has been updated(v0 --> V1), the 501th RouteConfiguration is rejected. Now the consumers would have v1 for the first 500 customers, but that will not be reflected in the resourceState dictionary.

The root issue is that the resourceState dict is out of sync with each individual resource's "consumer" (RDSConfigProviderImpl for example). I think instead we could possibly update the per-resource-state by calling from each resource consumer's onConfigUpdateFailed/(add the onConfigUpdateSucceeded).

Which requires quite a bit plumbing.

The right way to fix this is probably to make the update “transactional” or have a per-resource ACK/NACK. Otherwise Envoy may NACK a config, making the server believe that none of the resources are used, but will partially apply the config.

The current fix should address the issue of custom config validators rejecting the config, and the state tracking that is described in #20790.

@yanavlasov
Copy link
Contributor

Ping @htuch for review

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@htuch
Copy link
Member

htuch commented Sep 30, 2022

This makes sense to me. The broader issue that @stevenzzzz points to is not really about TTL tracking but how xDS resource acceptance works in relation to ACK/NACK, so I think let's track that in a separate issue.

@htuch htuch merged commit 6e6a3a1 into envoyproxy:main Sep 30, 2022
mattklein123 added a commit that referenced this pull request Sep 30, 2022
…estion (#23214)"

This reverts commit 6e6a3a1.

Coverage is failing on main.

Signed-off-by: Matt Klein <[email protected]>
RyanTheOptimist pushed a commit that referenced this pull request Oct 1, 2022
#23335)

Revert "delta-xds: perform resource state tracking after resource ingestion (#23214)"

This reverts commit 6e6a3a1.

Coverage is failing on main.

Signed-off-by: Matt Klein <[email protected]>
htuch pushed a commit that referenced this pull request Oct 4, 2022
Fixes delta-xDS protocol resource state tracking.
Prior to this PR, the implementation updated the state of a resource (tracked version, TTL), even if the fetched config was NACKed. This PR changes it to only update the tracked state after a resource was successfully ingested, and an ACK will be sent.
Note that this PR impacts the TTL of a resource, as now TTL timer starts after the resource was ingested rather than before.
This PR is essentially #23214 (which was reverted in #23335) with more coverage.

Risk Level: Medium for delta-xDS usage.
Testing: Added unit tests.
Docs Changes: N/A.
Release Notes: Added release note.
Platform Specific Features: N/A.
Runtime guard: Can be reverted by setting envoy.reloadable_features.delta_xds_subscription_state_tracking_fix to false.
Fixes #20790

Signed-off-by: Adi Suissa-Peleg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants