NETOBSERV-613: drop messages when they accumulate in the exporter #63

mariomac · 2022-10-13T16:17:04Z

The following PRs need to be merged first: #61 and #62. Then, this PR needs to be rebased into master.

This PR adds a flow's dropper before the exporter to avoid memory growing indefinitely or synchronized channel operations blocking previous parts of the system.

It also logs the number of dropped flows, suggesting the customer to take reactive actions.

The new architecture results in:

flowchart TD

    E(ebpf.FlowFetcher) --> |"pushes via<br/>RingBuffer"| RB(flow.RingBufTracer)
    E <--> |"polls<br/>PerCPUHashMap"| M(flow.MapTracer)
    RB --> |chan *flow.Record| ACC(flow.Accounter)

    ACC --> |"chan []*flow.Record"| DD(flow.Deduper)
    M --> |"chan []*flow.Record"| DD

    subgraph Optional
        DD
    end

    DD --> |"chan []*flow.Record"| CL(flow.CapacityLimiter)

    CL --> |"chan []*flow.Record"| EX("export.GRPCProto<br/>or<br/>export.KafkaProto")

github-actions · 2022-10-13T16:18:49Z

New image: ["quay.io/netobserv/netobserv-ebpf-agent:9c12b1d"]. It will expire after two weeks.

jotak · 2022-10-18T09:29:28Z

pkg/flow/limiter.go

+		df := c.droppedFlows
+		if df > 0 {
+			c.droppedFlows = 0
+			cllog.Warnf("%d flows were dropped during the last %s because the agent is forwarding "+


maybe it's time to have metrics rather than logs, for this kind of things? Or create a follow-up ?

I agree with you despite we can also keep this message as a first-hand notification that could be useful for the user. I'd create a JIRA issue to implement and expose metrics.

jotak · 2022-10-18T10:31:11Z

pkg/flow/limiter.go

+	droppedFlows int
+}
+
+func (c *CapacityLimiter) Limit(in <-chan []*Record, out chan<- []*Record) {


I'm not sure it's worth it to have a whole stage with in/out buffers for just this simple check, no ? I tried benchmarking against another version with the limiting done at the source and got slightly better time: 15016 ns/op versus 21698 ns/op with this limiter stage (same number of allocs)

Yeah, I just wanted to separate each functionality as much as possible to avoid overloading too much some pipeline stages. Also because otherwise we should implement this in multiple places: since the deduper is optional, it should Go both in the FlowAccounter and MapTracer.

About the performance improvement, I think in this case is negligible as the pipeline processes in batches and each batch is usually forwarded every CacheActiveTimeout period (5s by default).

fair enough, after our discussion I understand better.
Anyway we can still revisit later if we find there's optimizations to bring here

codecov-commenter · 2022-10-18T10:41:39Z

Codecov Report

Merging #63 (1f06575) into main (f63d104) will increase coverage by 0.24%.
The diff coverage is 37.77%.

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
+ Coverage   33.04%   33.28%   +0.24%     
==========================================
  Files          21       22       +1     
  Lines        1392     1433      +41     
==========================================
+ Hits          460      477      +17     
- Misses        913      936      +23     
- Partials       19       20       +1

Flag	Coverage Δ
unittests	`33.28% <37.77%> (+0.24%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/agent.go	`14.22% <0.00%> (-0.55%)`	⬇️
pkg/flow/limiter.go	`51.51% <51.51%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jotak · 2022-10-18T15:51:38Z

/lgtm

Mario Macias added 8 commits October 11, 2022 14:48

NETOBSERV-613: decrease premature eviction of eBPF hashmap

1c08399

fix tests

ab3b847

Fix e2e tests for out-of-order reporting

41c2ab1

rm unneeded comment

0a2446f

extend eventually time just in case

214445b

NETOBSERV-613: rework internal pipeline

6feb4ed

fix panic on close and goroutine leak on slow export

d91ed4d

NETOBSERV-613: capacity limiter

b1c7633

mariomac added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 13, 2022

mariomac requested review from jotak and OlivierCazade October 13, 2022 16:17

jotak reviewed Oct 18, 2022

View reviewed changes

mariomac changed the base branch from tmp-613-2 to main October 18, 2022 10:35

Merge branch 'main' into capacity-limiter

1f06575

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Oct 18, 2022

jotak approved these changes Oct 18, 2022

View reviewed changes

jotak merged commit 689b24d into netobserv:main Oct 19, 2022

mariomac deleted the capacity-limiter branch October 26, 2022 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-613: drop messages when they accumulate in the exporter #63

NETOBSERV-613: drop messages when they accumulate in the exporter #63

mariomac commented Oct 13, 2022 •

edited

Loading

github-actions bot commented Oct 13, 2022

jotak Oct 18, 2022

mariomac Oct 18, 2022

jotak Oct 18, 2022 •

edited

Loading

mariomac Oct 18, 2022

jotak Oct 18, 2022

codecov-commenter commented Oct 18, 2022

jotak commented Oct 18, 2022

NETOBSERV-613: drop messages when they accumulate in the exporter #63

NETOBSERV-613: drop messages when they accumulate in the exporter #63

Conversation

mariomac commented Oct 13, 2022 • edited Loading

github-actions bot commented Oct 13, 2022

jotak Oct 18, 2022

Choose a reason for hiding this comment

mariomac Oct 18, 2022

Choose a reason for hiding this comment

jotak Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

mariomac Oct 18, 2022

Choose a reason for hiding this comment

jotak Oct 18, 2022

Choose a reason for hiding this comment

codecov-commenter commented Oct 18, 2022

Codecov Report

jotak commented Oct 18, 2022

mariomac commented Oct 13, 2022 •

edited

Loading

jotak Oct 18, 2022 •

edited

Loading