-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-240 Topology metric selection #111
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
08ce9d4
to
5d93483
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return &TopologyQueryBuilder{ | ||
FlowQueryBuilder: NewFlowQueryBuilder(cfg, start, end, limit, reporter), | ||
topology: &Topology{ | ||
timeRange: timeRange, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are not using the timeRange properly here which lead to very strange value.
Here is an exemple:
If I select AVG/packets/ last hour, and I look at a communication between two specific node I get 105.26 packets/min, but now if I select SUM/packets/ last hour I get 2000 while I would expect something around 6000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. On my side I get the correct numbers like 7908/min
for 482000
in total.
The query is built the following way:
<url path>?query=
topk(
<k>,
sum by(<aggregations>) (
<function>(
{<label filters>}|<line filters>|json|<json filters>
|unwrap Bytes|__error__=""[<time>s]
)
)
)
&<query params>&step=60s
So you will get a set of values (one per minute) of average
or sum
in your example. Maybe network sampling makes avg function wrong since it aggregate flows in a single time 🤔
Also we can play with the step:
//TODO: check if step should be configurable. 60s is forced to help calculations on front end side
sb.WriteString("&step=60s")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As seen together, this appear when netflows are not communicating during a 60s
period of time (step). In that case Loki doesn't return any value for this time range and the avg / rate total calculation is false.
I have pushed a fix. Thanks @OlivierCazade 🥇
I saw that you pushed some commits while I was reviewing, so I tried again and I confirm that I can reproduce it each time. |
Strange, this has been pushed yesterday 🤔 |
@OlivierCazade I have created a bug task for the crash. Updating the lib doesn't fix this and it may impact other pending PRs. I prefer to fix this separately |
5777538
to
c90a89c
Compare
I confirm that the bug is on PF lib side and should not block this PR |
The "Rate" metric is not clear to me. Does it mean the percentage of traffic from the total? If so, maybe we should find another name: fraction, percentage (from total), ... |
Yes I agree. Ok for Percentage naming Maybe we should do a |
It's ok. It was just a naming confusion from my side |
For what it's worth, in Kiali it's called "Traffic Distribution": https://kiali.io/docs/tutorials/travels/04-observe/#graph-walkthrough (ie. from a given graph node, the percentage of outbound traffic to a destination from total outbound traffic). But if I understand correctly, it's not the same thing here. |
c90a89c
to
ec8ce86
Compare
rebased to merge #124 no changes on my side |
@jpinsonneau thanks for sharing the link:: I advise Also: sum_over_time is very important more than sum so this is something to add |
But |
Yes it is. It's used to do the graphs. The numbers on edges are calculated from loki matrix; so it's a I'm also considering to do a "weather channel" like animation with these data, to see the changes during time |
@jpinsonneau @jotak If the sum is sum_over_time then this can be called in the UI |
@eranra I don't follow you, how |
(you would have to divide the |
@jotak you are totally correct. It needs to be normalized over the time range ... what I see https://github.com/netobserv/network-observability-console-plugin/blob/main/web/src/api/loki.ts#L86 is confusing me ... this is exactly what I had in mind on rate but this is different than what @jpinsonneau explained. |
Sorry if I was not clear @eranra; obviously I divide by Have you checked the backend implementation ? You will have a better understanding of which loki functions are used. If we all agree then we can remove current |
This PR implements topology function selection:
And metrics selection:
Currently,
Sum
is showing total Bytes or Packets whileMax
/Avg
show speed. This could be decorrelated in the future in favor of a side panel option.Rate is based on loki logs rate, not metrics.
Will need #120 to be merged first