Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: [NPM] [LINUX] add NetPols in background #1969

Merged
merged 88 commits into from
Jul 19, 2023

Conversation

huntergregory
Copy link
Contributor

@huntergregory huntergregory commented May 18, 2023

Reason for Change:
NPM must use nftables on Ubuntu 22 (k8s 1.25+). NPM SysCalls into nftables take much longer than into legacy iptables. Customers with ~100+ NetPols can see:

  • Application timeouts (due to increased latencies while xtables lock is held).
  • Delay in applying NetPols.

Issue Fixed:

Requirements:

Notes:

Latency Comparison

Ran 2 experiments for each version. Used b-series burstable VMs.

Cluster has 640 services with 2 Pods each.

Version Applying 640 NetPols Max iptables Restore Latency List all rules (once at bootup)
Legacy (Ubuntu 18) 4 minutes 0.4 seconds 0.01 seconds
nftables (Ubuntu 22) 9-22 hours 3-5 minutes 5-11 minutes
This PR (for nftables) 13-17 minutes 3-5 minutes 5-11 minutes (same codepath)

Fix: Long Term

nftables do not scale for services · Issue #96018 · kubernetes/kubernetes (github.com)

iptables v1.8.8 has scale improvements. However, AKS is on iptables v1.8.4.

Fix: Near Term (this PR)

Optimizing for ADD NetworkPolicy.

Current Design

ADD NetworkPolicy:

  1. Create all IPSets at once via ipset restore.
  2. Create all iptables rules/chains at once via iptables-restore.
    Current Design for Linux NPM

This PR

Since iptables SysCalls take so long, add iptables rules for multiple NetworkPolicies at once.
This change will only take effect if nftables is present on machine & NPM's ConfigMap has NetPolInBackground=true.
Redesign for Linux NPM

Other Changes in this PR

Pipelines

  • Stop running pipelines for NPM v1.
  • Enable Linux Scale test.

Metrics

Adds the following metrics:

  1. npm_linux_iptables_restore_latency_seconds
  2. npm_linux_iptables_delete_latency_seconds_bucket


# HELP npm_linux_iptables_delete_latency_seconds Latency in seconds to delete an iptables rule
# TYPE npm_linux_iptables_delete_latency_seconds histogram
npm_linux_iptables_delete_latency_seconds_bucket{le="0.016"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="0.032"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="0.064"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="0.128"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="0.256"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="0.512"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="1.024"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="2.048"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="4.096"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="8.192"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="16.384"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="32.768"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="65.536"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="131.072"} 0
npm_linux_iptables_delete_latency_seconds_bucket{le="+Inf"} 0
npm_linux_iptables_delete_latency_seconds_sum 0
npm_linux_iptables_delete_latency_seconds_count 0
# HELP npm_linux_iptables_restore_latency_seconds Latency in seconds to restore iptables rules by operation label (add/delete NetPol)
# TYPE npm_linux_iptables_restore_latency_seconds histogram
npm_linux_iptables_restore_latency_seconds_bucket{operation="create",le="0.016"} 0
...

@huntergregory huntergregory requested a review from vakalapa May 18, 2023 19:46
@huntergregory huntergregory force-pushed the hgregory/05-18-netpol-background branch from 21bda59 to 849b570 Compare May 18, 2023 19:50
@@ -4,6 +4,8 @@ import "github.com/Azure/azure-container-networking/npm/util"

const (
defaultResyncPeriod = 15
defaultNetPolMaxBatches = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we cannot reuse the defaultApplyMaxBatches and defaultApplyInterval ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline to have separate threads/config for windows and Linux

// 1. Delete jump rules from ingress/egress chains to ingress/egress policy chains.
// We ought to delete these jump rules here in the foreground since if we add an NP back after deleting, iptables-restore --noflush can add duplicate jump rules.
deleteErr := pMgr.deleteOldJumpRulesOnRemove(networkPolicy)
if deleteErr != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to ignore errors here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually already ignore doesNotExistErrorCode in case the rule doesn't exist

@huntergregory
Copy link
Contributor Author

TODO: only apply netpol in background if nf-tables is present

@huntergregory huntergregory changed the title perf: [NPM] [LINUX] apply dirty NetPols in background perf: [NPM] [LINUX] add NetPols in background Jun 16, 2023
@@ -7,6 +7,8 @@ const (
defaultApplyMaxBatches = 100
defaultApplyInterval = 500
defaultMaxBatchedACLsPerPod = 30
defaultMaxPendingNetPols = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt 100 too much ? can we have something like 10 ?

Copy link
Contributor Author

@huntergregory huntergregory Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with MaxPendingNetPols=10. This doesn't seem to give us much performance boost. Took 12 hours to add 640 NetPols. With MaxPendingNetPols=100, the same takes <20 minutes (table in PR description has more details)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With MaxPendingNetPols=10, I also saw 1/10 NPM Pods crash with this error after adding 625/640 NetPols, 11.5 hours after NPM came online:

I0622 05:55:28.926162       1 trace.go:219] Trace[1241999867]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169 (22-Jun-2023 05:54:24.710) (total time: 64119ms):
Trace[1241999867]: ---"Objects listed" error:<nil> 64118ms (05:55:28.828)
Trace[1241999867]: [1m4.119544167s] [1m4.119544167s] END
I0622 05:55:29.727548       1 trace.go:219] Trace[257377008]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169 (22-Jun-2023 05:54:24.512) (total time: 65114ms):
Trace[257377008]: ---"Objects listed" error:<nil> 65018ms (05:55:29.531)
Trace[257377008]: [1m5.114375378s] [1m5.114375378s] END


// Prevent netpol in background unless we're in Linux and using nftables.
// This step must be performed after bootupDataplane() because it calls util.DetectIptablesVersion(), which sets the proper value for util.Iptables
dp.netPolInBackground = cfg.NetPolInBackground && !util.IsWindowsDP() && (strings.Contains(util.Iptables, "nft") || dp.debug)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we saying we will only make this change for nft ? why not for legacy too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline to only do this for nft. We will keep codepath the same for legacy, which doesn't need the performance boost

// The caller must lock netPolQueue.
func (dp *DataPlane) addPoliciesWithRetry(context string) {
netPols := dp.netPolQueue.dump()
klog.Infof("[DataPlane] adding policies %+v", netPols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a debug statement ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we don't have debug logging capability via klog

// 2. get Endpoints
var err error
var endpointList map[string]string
if !dp.inBootupPhase() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition correct ? the only condition for wasInBootPhase to be true is if dp.inBootupPhase() returns true.

here you are checking the opposite and wasInBootPhase , i think we will never hit this condition ? unless i am missing something ?

Copy link
Contributor Author

@huntergregory huntergregory Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to protect against a race:

  1. DP in bootup phase.
  2. dp.incrementBatchAndApplyIfNeeded() is called in last iteration of above loop.
  3. dp.FinishBootupPhase() in another thread

Few more details in code comment below

@huntergregory huntergregory requested a review from a team as a code owner June 23, 2023 18:52
@huntergregory huntergregory force-pushed the hgregory/05-18-netpol-background branch from 3503c1c to 6ad7639 Compare June 23, 2023 18:57
// addPoliciesWithRetry tries adding all policies. If this fails, it tries adding policies one by one.
// The caller must lock netPolQueue.
func (dp *DataPlane) addPoliciesWithRetry(context string) {
netPols := dp.netPolQueue.dump()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storing a queue of netpols as a map in background, but when applied we convert from map back to slice, any reason to not store background queue as a slice and then avoid this conversion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the map will help with dp.RemovePolicy(), where we remove the NetPol from the queue

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

No commit pushedDate could be found for PR 1969 in repo Azure/azure-container-networking

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s), but failed to run 1 pipeline(s).

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s), but failed to run 1 pipeline(s).

@huntergregory
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s), but failed to run 1 pipeline(s).

@vakalapa vakalapa merged commit ebddca1 into master Jul 19, 2023
@vakalapa vakalapa deleted the hgregory/05-18-netpol-background branch July 19, 2023 16:13
@xanecs
Copy link

xanecs commented Jul 27, 2023

@huntergregory why not update the iptables binary that is used in the docker image?

@huntergregory
Copy link
Contributor Author

@huntergregory why not update the iptables binary that is used in the docker image?

Hi @xanecs, the limitation seems to be in the iptables version installed on the AKS node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linux npm Related to NPM.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants