Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: remove apply pacing #133175

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Oct 22, 2024

This commit removes the committed entries application pacing from the raftLog type. This is done for a few reasons:

  • CRDB does not use this capability. Entries are applied immediately in the Ready handler, so there are no times when two MsgStorageApply messages are in-flight.
  • If we re-introduce this pacing, it will be done outside the raft package. Instead of having the per-range in-flight limit, we are going to need a node/store level limit, to help avoid OOMs.
  • We are moving away from doing IO from within the RawNode. Fetching committed entries from log storage will be done via the LogSnapshot at some point. It will be natural for the caller to implement the size policies at its level.
  • This change separates the concerns of raftLog being a data structure, vs flow control / pacing policies. Before this change, the two were coupled in the raftLog type.

Related to #105850

Epic: none
Release note: none
Epic: none
Release note: none
This commit removes the committed entries application pacing from the
raftLog type. This is done for a few reasons:

- CRDB does not use this capability. Entries are applied immediately in
  the Ready handler, so there are no times when two MsgStorageApply
  messages are in-flight.
- If we re-introduce this pacing, it will be done outside the raft
  package. Instead of having the per-range in-flight limit, we are going
  to need a node/store level limit, to help avoid OOMs.
- We are moving away from doing IO from within the RawNode. Fetching
  committed entries from log storage will be done via the LogSnapshot at
  some point. It will be natural for the caller to implement the size
  policies at its level.
- This change separates the concerns of raftLog being a data structure,
  vs flow control / pacing. Before this change, the two were coupled in
  the raftLog type.

Epic: none
Release note: none
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv requested review from nvanbenschoten and tbg October 22, 2024 18:11
@pav-kv pav-kv force-pushed the raft-rm-apply-limit branch from 4cb98c4 to 7d26089 Compare October 22, 2024 23:25
Epic: none
Release note: none
@pav-kv pav-kv force-pushed the raft-rm-apply-limit branch from 7d26089 to 190baa0 Compare October 22, 2024 23:34
@pav-kv pav-kv marked this pull request as ready for review October 22, 2024 23:44
@pav-kv pav-kv requested a review from a team as a code owner October 22, 2024 23:44
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this code that you deleted would've been useful only in cases in which entry application is delegated to an async thread, right? I'm surprised that you didn't need to remove/change any tests, meaning that there wasn't any coverage. Looks like most of the code you're removing was added in etcd-io/raft@2e0653d.
I agree that we're moving toward a world in which raft communicates what can be applied in principle (i.e. the commit index) rather than doling it out piecemeal, so I'm on board with stripping this functionality down to the bare minimum (don't emit entries to apply unless the previous batch is acked). Is there a reason you didn't strip it further to the point where entries have to be applied before acking the ready?

@@ -505,8 +505,6 @@ process-ready 1 2 3
HardState Term:1 Vote:1 Commit:13 Lead:1 LeadEpoch:1
Entries:
1/15 EntryNormal "prop_4"
CommittedEntries:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test changing?

Copy link
Collaborator Author

@pav-kv pav-kv Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to audit it and confirm, but I think: there can only be one MsgStorageApply in flight now, so some commits migrated between Readys. The test probably exercised > 1 message in flight.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 23, 2024

@tbg

So this code that you deleted would've been useful only in cases in which entry application is delegated to an async thread, right?

Yes. We don't do it today. If we did, we would have a risk of OOM because there is no limit on the total volume of entries in the apply thread's queue across all ranges. So if we ever want an asynchronous append thread, it should come with some store/node-scoped flow control outside raft.

I'm surprised that you didn't need to remove/change any tests, meaning that there wasn't any coverage.

There was coverage. I had to fix ~5 unit tests which were exercising this flow control capability, and also in some cases assumed more than one MsgStorageApply can be in flight. Like the datadriven test.

Is there a reason you didn't strip it further to the point where entries have to be applied before acking the ready?

Just making a minimal change scoped to the raftLog type for now. Restricting when entries are applied seems unnecessary though - don't want to go back to the pre-async-writes API where write completion is assumed implicitly. So the API is still asynchronous and explicit, but the flow control has been made primitive (one limited-size MsgStorageApply in flight at a time), before we maybe enhance it again at a different level.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM, thanks for the explanation. I missed the tests, but see them now.

storage: storage,
unstable: newUnstable(last, logger),

// Initialize our committed and applied pointers to the time of the last
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants