Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-240 Topology metric selection #111

Merged
merged 2 commits into from
Apr 20, 2022

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Apr 7, 2022

This PR implements topology function selection:

  • Sum
  • Max
  • Average
  • Rate
    image

And metrics selection:

  • Bytes
  • Packets
    image

Currently, Sum is showing total Bytes or Packets while Max / Avg show speed. This could be decorrelated in the future in favor of a side panel option.
Rate is based on loki logs rate, not metrics.

Will need #120 to be merged first

@openshift-ci
Copy link

openshift-ci bot commented Apr 7, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Apr 7, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from jpinsonneau after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpinsonneau
Copy link
Contributor Author

/retest

Copy link
Collaborator

@OlivierCazade OlivierCazade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug :

image

To reproduce :

  • Select topology view
  • Collapse a group of pod by namespace
  • Change the metric type (ex: from avg to sum)
  • Expand the collapsed group

return &TopologyQueryBuilder{
FlowQueryBuilder: NewFlowQueryBuilder(cfg, start, end, limit, reporter),
topology: &Topology{
timeRange: timeRange,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are not using the timeRange properly here which lead to very strange value.

Here is an exemple:
If I select AVG/packets/ last hour, and I look at a communication between two specific node I get 105.26 packets/min, but now if I select SUM/packets/ last hour I get 2000 while I would expect something around 6000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. On my side I get the correct numbers like 7908/min for 482000 in total.

The query is built the following way:

<url path>?query=
	topk(
	 	<k>,
		sum by(<aggregations>) (
			<function>(
				{<label filters>}|<line filters>|json|<json filters>
					|unwrap Bytes|__error__=""[<time>s]
			)
		)
	)
	&<query params>&step=60s

So you will get a set of values (one per minute) of average or sum in your example. Maybe network sampling makes avg function wrong since it aggregate flows in a single time 🤔

Also we can play with the step:

//TODO: check if step should be configurable. 60s is forced to help calculations on front end side
sb.WriteString("&step=60s")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As seen together, this appear when netflows are not communicating during a 60s period of time (step). In that case Loki doesn't return any value for this time range and the avg / rate total calculation is false.

I have pushed a fix. Thanks @OlivierCazade 🥇

@jpinsonneau
Copy link
Contributor Author

Bug :

image

I can't reproduce that one on my side. I used to have that when multiple elements had the same ids (so it can create an infinite parent / child relation) or with empty groups. Both are supposed to be managed here.

Do you get it all the time ?

@OlivierCazade
Copy link
Collaborator

Do you get it all the time ?

I saw that you pushed some commits while I was reviewing, so I tried again and I confirm that I can reproduce it each time.
Last commit 5d93483 : topology function + metric selection

@jpinsonneau
Copy link
Contributor Author

I saw that you pushed some commits while I was reviewing, so I tried again and I confirm that I can reproduce it each time.

Strange, this has been pushed yesterday 🤔
jpinsonneau force-pushed the topology_metrics branch from 08ce9d4 to 5d93483 23 hours ago

@jpinsonneau
Copy link
Contributor Author

@OlivierCazade I have created a bug task for the crash. Updating the lib doesn't fix this and it may impact other pending PRs. I prefer to fix this separately

https://issues.redhat.com/browse/NETOBSERV-293

@jpinsonneau jpinsonneau force-pushed the topology_metrics branch 2 times, most recently from 5777538 to c90a89c Compare April 13, 2022 15:42
@jpinsonneau
Copy link
Contributor Author

I confirm that the bug is on PF lib side and should not block this PR
patternfly/patternfly-react#7261

@mariomac
Copy link

The "Rate" metric is not clear to me. Does it mean the percentage of traffic from the total? If so, maybe we should find another name: fraction, percentage (from total), ...

@mariomac mariomac self-assigned this Apr 20, 2022
@jpinsonneau
Copy link
Contributor Author

The "Rate" metric is not clear to me. Does it mean the percentage of traffic from the total? If so, maybe we should find another name: fraction, percentage (from total), ...

Yes I agree. Ok for Percentage naming
For now rate is the % of rows found on loki perspective so if you use sampling or at least your packets count are inconsistent between each entry, the rate number will not be relevant at all.

Maybe we should do a sum_over_time on a selected field and then do the math on our side ?
We can disable it for now if you prefer

@mariomac
Copy link

It's ok. It was just a naming confusion from my side

@jotak
Copy link
Member

jotak commented Apr 20, 2022

The "Rate" metric is not clear to me. Does it mean the percentage of traffic from the total? If so, maybe we should find another name: fraction, percentage (from total), ...

For what it's worth, in Kiali it's called "Traffic Distribution": https://kiali.io/docs/tutorials/travels/04-observe/#graph-walkthrough (ie. from a given graph node, the percentage of outbound traffic to a destination from total outbound traffic).

But if I understand correctly, it's not the same thing here.

@jpinsonneau
Copy link
Contributor Author

rebased to merge #124

no changes on my side

@openshift-ci openshift-ci bot added the lgtm label Apr 20, 2022
@jpinsonneau jpinsonneau merged commit 1b832c3 into netobserv:main Apr 20, 2022
@eranra
Copy link
Contributor

eranra commented May 11, 2022

@jpinsonneau thanks for sharing the link::

I advise We can disable it for now if you prefer -- the number of rows is not very useful for customers

Also: sum_over_time is very important more than sum so this is something to add

@jotak
Copy link
Member

jotak commented May 12, 2022

@jpinsonneau
Copy link
Contributor Author

But Sum is actually already sum_over_time, no? cf https://github.com/netobserv/network-observability-console-plugin/blob/main/pkg/loki/topology_query.go#L62

Yes it is. It's used to do the graphs. The numbers on edges are calculated from loki matrix; so it's a sum of sum_over_time.
https://github.com/netobserv/network-observability-console-plugin/blob/main/web/src/api/loki.ts#L70

I'm also considering to do a "weather channel" like animation with these data, to see the changes during time

@eranra
Copy link
Contributor

eranra commented May 12, 2022

@jpinsonneau @jotak If the sum is sum_over_time then this can be called in the UI rate which is what you see on the screen ( and remove the current rate ) ---

@jotak
Copy link
Member

jotak commented May 12, 2022

@eranra I don't follow you, how sum_over_time can be called rate?
sum_over_time returns the absolute amount of Bytes transferred in the selected time frame, like 50MB.
Whereas rate IMO should be what is currently called in the UI max or avg, it's the average or max bitrate (Bps) in that time frame
And I agree that what is called currently Rate should be removed.

@jotak
Copy link
Member

jotak commented May 12, 2022

(you would have to divide the sum_over_time with the observed time range to retrieve a rate)

@eranra
Copy link
Contributor

eranra commented May 13, 2022

@jotak you are totally correct. It needs to be normalized over the time range ... what I see https://github.com/netobserv/network-observability-console-plugin/blob/main/web/src/api/loki.ts#L86 is confusing me ... this is exactly what I had in mind on rate but this is different than what @jpinsonneau explained.

@jpinsonneau
Copy link
Contributor Author

Sorry if I was not clear @eranra; obviously I divide by rangeInMinutes on client side to show edge tags for avg and rate as you saw in the code.

Have you checked the backend implementation ? You will have a better understanding of which loki functions are used.

If we all agree then we can remove current rate and rename 'Avg' to 'Rate' 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants