Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-613: drop messages when they accumulate in the exporter #63

Merged
merged 9 commits into from
Oct 19, 2022
Merged

NETOBSERV-613: drop messages when they accumulate in the exporter #63

merged 9 commits into from
Oct 19, 2022

Conversation

mariomac
Copy link

@mariomac mariomac commented Oct 13, 2022

The following PRs need to be merged first: #61 and #62. Then, this PR needs to be rebased into master.

This PR adds a flow's dropper before the exporter to avoid memory growing indefinitely or synchronized channel operations blocking previous parts of the system.

It also logs the number of dropped flows, suggesting the customer to take reactive actions.

The new architecture results in:

flowchart TD

    E(ebpf.FlowFetcher) --> |"pushes via<br/>RingBuffer"| RB(flow.RingBufTracer)
    E <--> |"polls<br/>PerCPUHashMap"| M(flow.MapTracer)
    RB --> |chan *flow.Record| ACC(flow.Accounter)

    ACC --> |"chan []*flow.Record"| DD(flow.Deduper)
    M --> |"chan []*flow.Record"| DD

    subgraph Optional
        DD
    end

    DD --> |"chan []*flow.Record"| CL(flow.CapacityLimiter)

    CL --> |"chan []*flow.Record"| EX("export.GRPCProto<br/>or<br/>export.KafkaProto")

Loading

@mariomac mariomac added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 13, 2022
@github-actions
Copy link

New image: ["quay.io/netobserv/netobserv-ebpf-agent:9c12b1d"]. It will expire after two weeks.

df := c.droppedFlows
if df > 0 {
c.droppedFlows = 0
cllog.Warnf("%d flows were dropped during the last %s because the agent is forwarding "+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's time to have metrics rather than logs, for this kind of things? Or create a follow-up ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you despite we can also keep this message as a first-hand notification that could be useful for the user. I'd create a JIRA issue to implement and expose metrics.

droppedFlows int
}

func (c *CapacityLimiter) Limit(in <-chan []*Record, out chan<- []*Record) {
Copy link
Member

@jotak jotak Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth it to have a whole stage with in/out buffers for just this simple check, no ? I tried benchmarking against another version with the limiting done at the source and got slightly better time: 15016 ns/op versus 21698 ns/op with this limiter stage (same number of allocs)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just wanted to separate each functionality as much as possible to avoid overloading too much some pipeline stages. Also because otherwise we should implement this in multiple places: since the deduper is optional, it should Go both in the FlowAccounter and MapTracer.

About the performance improvement, I think in this case is negligible as the pipeline processes in batches and each batch is usually forwarded every CacheActiveTimeout period (5s by default).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, after our discussion I understand better.
Anyway we can still revisit later if we find there's optimizations to bring here

@mariomac mariomac changed the base branch from tmp-613-2 to main October 18, 2022 10:35
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 18, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #63 (1f06575) into main (f63d104) will increase coverage by 0.24%.
The diff coverage is 37.77%.

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
+ Coverage   33.04%   33.28%   +0.24%     
==========================================
  Files          21       22       +1     
  Lines        1392     1433      +41     
==========================================
+ Hits          460      477      +17     
- Misses        913      936      +23     
- Partials       19       20       +1     
Flag Coverage Δ
unittests 33.28% <37.77%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/agent.go 14.22% <0.00%> (-0.55%) ⬇️
pkg/flow/limiter.go 51.51% <51.51%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jotak
Copy link
Member

jotak commented Oct 18, 2022

/lgtm

@jotak jotak merged commit 689b24d into netobserv:main Oct 19, 2022
@mariomac mariomac deleted the capacity-limiter branch October 26, 2022 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants