Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix default memberlist configuration value for RetransmitMult. #4269

Merged
merged 5 commits into from
Jun 11, 2021

Conversation

stevesg
Copy link
Contributor

@stevesg stevesg commented Jun 9, 2021

What this PR does:
If configuration is not explicitly given for RetransmitMult (via
-memberlist.retransmit_factor), then it is intended to be picked up
from DefaultLANConfig. However, though the correct value was being
used to configure memberlist itself, zero would be passed into the
TransmitLimitedQueue used for broadcasting ring updates. This
essentially means that ring updates are only ever gossiped once.

Which issue(s) this PR fixes:
Fixes #4010 (unconfirmed)

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! This PR definitely needs a CHANGELOG entry 😉

@stevesg
Copy link
Contributor Author

stevesg commented Jun 10, 2021

Looking into unit test failure...

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! I left a couple of nits in the integration test. Failing unit tests looks flaky tests unrelated from this PR changes.

@bboreham
Copy link
Contributor

I was confused that we document the default -memberlist.retransmit_factor as zero, when you say your expectation was that it will be 4. It seems that KV.buildMemberlistConfig() does the same thing with a range of arguments.

Could we have KVConfig.RegisterFlags() call memberlist.DefaultLANConfig() and obtain the defaults that way?

(this could be a separate issue/PR)

@stevesg
Copy link
Contributor Author

stevesg commented Jun 10, 2021

That's a good idea - makes the defaults more obvious and may have avoided this bug. I'll try it as a separate PR.

Edit: Thinking about this, it might be a breaking change, setting a configuration explicitly to "0" will have different behavior. Will think on it.

@pstibrany
Copy link
Contributor

I was confused that we document the default -memberlist.retransmit_factor as zero, when you say your expectation was that it will be 4. It seems that KV.buildMemberlistConfig() does the same thing with a range of arguments.

Could we have KVConfig.RegisterFlags() call memberlist.DefaultLANConfig() and obtain the defaults that way?

This can introduce unintended change of default values over time. (Our current approach can as well, but it’s more explicit about what’s going on). That said, I think it would be helpful.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jun 10, 2021
@stevesg stevesg marked this pull request as ready for review June 10, 2021 18:02
stevesg added 4 commits June 10, 2021 21:23
If configuration is not explicitly given for RetransmitMult (via
`-memberlist.retransmit_factor`), then it is intended to be picked up
from `DefaultLANConfig`. However, though the correct value was being
used to configure `memberlist` itself, zero would be passed into the
`TransmitLimitedQueue` used for broadcasting ring updates. This
essentially means that ring updates are only ever gossiped once.

Signed-off-by: Steve Simpson <[email protected]>
Signed-off-by: Steve Simpson <[email protected]>
Signed-off-by: Steve Simpson <[email protected]>
@stevesg
Copy link
Contributor Author

stevesg commented Jun 10, 2021

Hold off merging this - the unit test failure appears to be a race condition in the unit test: the test will signal to shut down the memberlist service, then it will try and grab the state from the clients. Sometimes the memberlist service will shutdown before the test was able to grab the state.

@pstibrany
Copy link
Contributor

Hold off merging this

You can mark PR as draft, that prevents merging.

The test was shutting down the KV store then attempting to read form it.
Sometimes this would work if the KV took some time to shutdown, which it
often will, but if it shuts down quickly, then the read will fail.

Signed-off-by: Steve Simpson <[email protected]>
@stevesg
Copy link
Contributor Author

stevesg commented Jun 11, 2021

Fixed now, didn't seem worth pulling out into a separate PR.

@pstibrany pstibrany merged commit 9aa910f into cortexproject:master Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memberlist keeps resurrecting deleted store-gateways
4 participants