Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm: track dial cancellation reason #2532

Merged
merged 4 commits into from
Aug 30, 2023
Merged

Conversation

sukunrt
Copy link
Member

@sukunrt sukunrt commented Aug 27, 2023

closes: #2321

@sukunrt sukunrt force-pushed the sukun/swarm-metrics-close branch from 3665bd0 to 055cec0 Compare August 27, 2023 12:13
@sukunrt sukunrt requested a review from marten-seemann August 27, 2023 12:13
p2p/net/swarm/swarm_metrics.go Outdated Show resolved Hide resolved
if errors.Is(cause, errParentContextCanceled) {
// parent was canceled
e = "canceled"
} else if errors.Is(cause, context.Canceled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks brittle. Should we have a separate cause for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. That would be cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed errParentContextCanceled and added errConcurrentDialSuccessful.

Now if the cause is context.Canceled it necessarily means the application cancelled the context and if the cause is errConcurrentDialSuccessful it means a concurrent dial succeeded.

p2p/net/swarm/swarm_metrics.go Outdated Show resolved Hide resolved
e = "canceled: dial successful"
} else {
// something else
e = "canceled: other"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this to happen at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so. Didnt happen on my kubo node. I ran it for about an hour

p2p/net/swarm/dial_sync.go Outdated Show resolved Hide resolved
delete(ds.dials, p)
}
}()
return ad.dial(ctx)
conn, err := ad.dial(ctx) // updated err is used in defered func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this works? Isn't := creating a new err here?

Why do we have this defer anyway, there's only a single return from this function.

Copy link
Member Author

@sukunrt sukunrt Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err is declared in the beginning of the method so this will assign to it.

Anyway, I will remove the defer. That is confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it.

@sukunrt sukunrt requested a review from marten-seemann August 28, 2023 05:34
@sukunrt sukunrt force-pushed the sukun/swarm-metrics-close branch from bb7cc59 to 69b01a6 Compare August 28, 2023 05:47
ad.refCnt--
if ad.refCnt == 0 {
if err == nil {
ad.cancelCause(errConcurrentDialSuccessful)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? What would happen if we're just dialing one address, and that dial is successful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That dial would have succeeded so the context cancellation has no effect on that dial. That method would have returned.

There are two options:

  1. We set call cancelCause(nil) and live with the brittle check that errors.Is(Cause(err), context.Canceled) signals concurrent successful dial. Here we override actual cancelations of parent context with errParentCanceled. This is the strategy in 055cec0 commit.
  2. Is the latest commit. We explicitly signal errConcurrentDialSuccessful when err is nil. In case the dial succeeds, this is not a problem since we will not call FailedDial for tracking metrics in that case.

@marten-seemann marten-seemann merged commit 1153b1b into master Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use CancelWithCause in swarm.dialPeer for better metrics reporting
2 participants