Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mdns): allow setting an initial delay before sending queries to discover peers. #3323

Conversation

stormshield-damiend
Copy link

@stormshield-damiend stormshield-damiend commented Jan 13, 2023

Description

feat(mdns): allow setting an initial delay before sending queries to discover peers (#3319)

Notes

In a similar way to identify behaviour, manage an initial delay before the first mdns query is sent on an interface.

In a docker environment, the first mdns query can be lost if the interface is not ready yet. As the interval query is by default rather large (5 minutes), using an initial delay before the first mdns query is sent would improve the speed of peers discovery.

Links to any relevant issues

Fixes #3319

Open Questions

I have kept the original behaviour, the initial delay is set by default to 0, perhaps this could be set to another default value like for example 500ms as done for identify.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate if we don't add another configuration option for this. I consider #3319 to be a bug that we should fix.

This PR only adds a configuration that in its default value does not fix the bug.

Have you measured, what such an initial delay would have to be set to to be effective? I'd be happy to merge a PR that adds a fixed 500ms delay like identify as long as we don't expose it as a config option.

@stormshield-damiend
Copy link
Author

I'd appreciate if we don't add another configuration option for this. I consider #3319 to be a bug that we should fix.

This PR only adds a configuration that in its default value does not fix the bug.

Have you measured, what such an initial delay would have to be set to to be effective? I'd be happy to merge a PR that adds a fixed 500ms delay like identify as long as we don't expose it as a config option.

On our test platform, 500ms does the job (this is the value we use) but this highly depend on the way your dockers are connected and init (in our case we manage the network connections between our dockers by hands, thus it take some time to become ready).

Identify has 500ms as default value and a method on Config to configure the initial_delay.

I can modify the PR to hardcode the 500ms and remove the configuration part, or i can also use 500ms as default value (to fix the issue for some case) and have the configuration token available to fix other cases where 500ms is not enough.

What do you prefer ?

@thomaseizinger
Copy link
Contributor

It would be good to get @mxinden's opinion. My personal preference would be to not expose another configuration option. I also consider the initial delay in identify a hack if it was added for the same reason.

In an ideal world, I think we should fix the actual bug which IMO is because we unconditionally wait for the delay and don't care about no / failure responses.

@stormshield-damiend
Copy link
Author

It would be good to get @mxinden's opinion. My personal preference would be to not expose another configuration option. I also consider the initial delay in identify a hack if it was added for the same reason.

In an ideal world, I think we should fix the actual bug which IMO is because we unconditionally wait for the delay and don't care about no / failure responses.

Sure, lets wait for mxinden's opinion

@stormshield-damiend
Copy link
Author

It would be good to get @mxinden's opinion. My personal preference would be to not expose another configuration option. I also consider the initial delay in identify a hack if it was added for the same reason.

In an ideal world, I think we should fix the actual bug which IMO is because we unconditionally wait for the delay and don't care about no / failure responses.

did you get any update from @mxinden ?

@thomaseizinger
Copy link
Contributor

It would be good to get @mxinden's opinion. My personal preference would be to not expose another configuration option. I also consider the initial delay in identify a hack if it was added for the same reason.
In an ideal world, I think we should fix the actual bug which IMO is because we unconditionally wait for the delay and don't care about no / failure responses.

did you get any update from @mxinden ?

I vaguely remember that we briefly discussed briefly and I think we are on the same page. Our components should do the right thing in their default configuration in ideally every situation and not just when they are configured in a certain way.

Not increasing the public API but making the implementation more resilient to a corner case like this is a low-friction change that I am happy to merge without further input as it doesn't have any downsides.

Let me know if you want to work on this!

@stormshield-damiend
Copy link
Author

It would be good to get @mxinden's opinion. My personal preference would be to not expose another configuration option. I also consider the initial delay in identify a hack if it was added for the same reason.
In an ideal world, I think we should fix the actual bug which IMO is because we unconditionally wait for the delay and don't care about no / failure responses.

did you get any update from @mxinden ?

I vaguely remember that we briefly discussed briefly and I think we are on the same page. Our components should do the right thing in their default configuration in ideally every situation and not just when they are configured in a certain way.

Not increasing the public API but making the implementation more resilient to a corner case like this is a low-friction change that I am happy to merge without further input as it doesn't have any downsides.

Let me know if you want to work on this!

To summarize, are you fine with

  1. set the default initial delay to 500ms
  2. keep the api to change it if needed (as done in identify)

If so i can re-work the PR and submit an updated version.
If you prefer only the 1. (500ms without the public API) i am also fine with this. Just tell me.

@thomaseizinger
Copy link
Contributor

It would be good to get @mxinden's opinion. My personal preference would be to not expose another configuration option. I also consider the initial delay in identify a hack if it was added for the same reason.
In an ideal world, I think we should fix the actual bug which IMO is because we unconditionally wait for the delay and don't care about no / failure responses.

did you get any update from @mxinden ?

I vaguely remember that we briefly discussed briefly and I think we are on the same page. Our components should do the right thing in their default configuration in ideally every situation and not just when they are configured in a certain way.
Not increasing the public API but making the implementation more resilient to a corner case like this is a low-friction change that I am happy to merge without further input as it doesn't have any downsides.
Let me know if you want to work on this!

To summarize, are you fine with

  1. set the default initial delay to 500ms
  2. keep the api to change it if needed (as done in identify)

Uhm no, that is not quite what I meant. I don't think an initial delay is the appropriate solution. It doesn't address the actual problem: not having received a response for a request that we sent out.

mDNS is multicast though and based on UDP, so a response is not necessarily guaranteed. Additionally, we could also just be in a network where there are no other mDNS devices.

Thus I think an appropriate solution would be:

Instead of sending out requests with a fixed interval build a state machine that:

  • Sends out a request, start a short timer (10s or something)
  • If no response within 10s, send another request
  • Repeat the above Y times or until receive a response
  • In either case, move to a longer timeout once we received a response

Essentially, what this does is search for devices at an increased rate until we either find some or we consider there to be no devices.

This should be a lot more resilient to any form of problems in the network.

Let me know what you think of this solution.

@mxinden
Copy link
Member

mxinden commented Feb 24, 2023

I also consider the initial delay in identify a hack if it was added for the same reason.

Agreed.

Essentially, what this does is search for devices at an increased rate until we either find some or we consider there to be no devices.

That sounds reasonable to me.

Maybe also worth checking out whether go-libp2p has a solution for this issue.

https://github.com/libp2p/go-libp2p/blob/master/p2p/discovery/mdns/mdns.go

@stormshield-damiend
Copy link
Author

stormshield-damiend commented Feb 24, 2023

I also consider the initial delay in identify a hack if it was added for the same reason.

Agreed.

Essentially, what this does is search for devices at an increased rate until we either find some or we consider there to be no devices.

That sounds reasonable to me.

Maybe also worth checking out whether go-libp2p has a solution for this issue.

https://github.com/libp2p/go-libp2p/blob/master/p2p/discovery/mdns/mdns.go

Go implementation uses zeroconf library which does this for the mdns client part :

  • max query interval is hard coded to 60s
  • initial query interval is hard coded to 4s
  • automata send queries until a response is received. it increases the periodic delay in case of no response or error using a random multiplier between 1.5 and 2.5, the delay being capped at max interval (60s)

Do you want we use the same values for the rust state automata ?

@stormshield-damiend
Copy link
Author

It would be good to get @mxinden's opinion. My personal preference would be to not expose another configuration option. I also consider the initial delay in identify a hack if it was added for the same reason.
In an ideal world, I think we should fix the actual bug which IMO is because we unconditionally wait for the delay and don't care about no / failure responses.

did you get any update from @mxinden ?

I vaguely remember that we briefly discussed briefly and I think we are on the same page. Our components should do the right thing in their default configuration in ideally every situation and not just when they are configured in a certain way.
Not increasing the public API but making the implementation more resilient to a corner case like this is a low-friction change that I am happy to merge without further input as it doesn't have any downsides.
Let me know if you want to work on this!

To summarize, are you fine with

  1. set the default initial delay to 500ms
  2. keep the api to change it if needed (as done in identify)

Uhm no, that is not quite what I meant. I don't think an initial delay is the appropriate solution. It doesn't address the actual problem: not having received a response for a request that we sent out.

mDNS is multicast though and based on UDP, so a response is not necessarily guaranteed. Additionally, we could also just be in a network where there are no other mDNS devices.

Thus I think an appropriate solution would be:

Instead of sending out requests with a fixed interval build a state machine that:

* Sends out a request, start a short timer (10s or something)

* If no response within 10s, send another request

* Repeat the above Y times or until receive a response

* In either case, move to a longer timeout once we received a response

Essentially, what this does is search for devices at an increased rate until we either find some or we consider there to be no devices.

This should be a lot more resilient to any form of problems in the network.

Let me know what you think of this solution.

Ok we (stormshield) will work on this new solution, you can drop this PR we will submit a new one with the updated code. It may take some time as we have other work to do internally.

@mergify
Copy link
Contributor

mergify bot commented May 1, 2023

This pull request has merge conflicts. Could you please resolve them @stormshield-damiend? 🙏

@thomaseizinger
Copy link
Contributor

Obsoleted by #3975.

mergify bot pushed a commit that referenced this pull request May 25, 2023
Peer discovery with mDNS can be very slow, particularly if the first mDNS query is lost. This patch resolves it by adjusting the timer with an adaptive initial interval. We start with a very short timer (500 ms). If a peer is discovered before the end of the timer, then the timer is reset to the normal query interval value (300s), otherwise the timer's value is multiplied by 2 until it reaches the normal query interval value.

Related: #3323.
Resolves: #3319.

Pull-Request: #3975.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Manage an initial delay before the first mdns query is sent on an interface
3 participants