Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: send MsgAppResp to replica co-located with client #325

Closed
tbg opened this issue Feb 17, 2015 · 12 comments
Closed

storage: send MsgAppResp to replica co-located with client #325

tbg opened this issue Feb 17, 2015 · 12 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@tbg
Copy link
Member

tbg commented Feb 17, 2015

@spencerkimball mentioned this in #230:

Because a major focus for all leasing strategies is to reduce latency in the wide area, we implemented as part of our baseline a latency optimization described by Castro [5]. This optimization reduces the commit latency perceived by clients that are co-located with a replica other than the Multi-Paxos stable leader, by having other replicas transmit their AcceptReplies to both the stable leader and to the replica near the client. Thus, in the common case, the client does not need to wait the additional time for a message to come back from the stable leader, which reduces commit time from four one-way delays to three.

The paper is Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems, 20(4):398–461, November 2002.

This is certainly something that would be useful.

@bdarnell
Copy link
Contributor

This seems risky when translated from Paxos to Raft. If a leader broadcasts MsgApp and then dies, but the MsgAppResps are sent to the other followers, then those followers may consider the entry committed even though it was not committed by the leader of that term. This may turn out to be safe (if a quorum have the new log entry then no node without it can be elected), but it's very subtle.

@tbg
Copy link
Member Author

tbg commented Feb 18, 2015

I agree that it's subtle. Yet, as you say, if an entry is acknowledged by a majority of the nodes (within the same term - otherwise see raft.pdf, Figure 8), then, no matter what the leader does, that entry is guaranteed to be committed as nobody can be elected leader without it. So it should be sufficient to do the following:

  • add an optional CC (Carbon Copy) field in the MsgApp message.
  • as a leader, upon receiving a request from follower, note that follower in CC for the corresponding MsgApp messages.
  • upon replying positively to a MsgApp with CC set, send the MsgAppResp to the CC replica as well.
  • at the colocated replica:
    • upon redirecting a command to the leader, get ready to receive and count MsgAppResps with identical term for that entry (in practice, just give up when there is more than one term; also you probably don't want to prepare to actually count, but just start counting when the first MsgAppResp shows up).
    • check the replica's outgoing messages: If it is sending a MsgAppResp for that entry, and the term matches, increase the counter by two (to account for the fact that the co-located replica and the leader clearly have this entry).
    • once the counter equals or exceeds the quorum size, commit the entry locally.
    • once the entry is committed locally or overwritten, stop counting.

Does that sound sound?

@spencerkimball
Copy link
Member

It should be "once the counter is equal to the quorum size, commit the entry locally". Unless I'm missing something.

@tbg
Copy link
Member Author

tbg commented Feb 18, 2015

Almost. Changed to "equals or exceeds". Since we're incrementing by two in one of the cases, we may not exactly hit the quorum.

@bdarnell
Copy link
Contributor

This seems sound, although figuring out which entries are covered by another node's MsgAppResp is non-trivial.

@tbg
Copy link
Member Author

tbg commented Feb 20, 2015

I hadn't thought about that. The sending node is at the discretion to put whatever they want in that message, so in the worst case they could hash the data of the log entry (or whatever other unique identifier we can come up with) and send that along as an Entry when CC'ing.

Also maybe the CC field should sit on the Entry instead of the MsgApp because MsgApp may contain multiple entries with different CCs and that should trigger a CC message to each CC'ed host with Entries (containing unique identifiers) for each CCed entry.

Furthermore, the CC'ed MsgAppResp shouldn't look exactly like the original MsgAppResp as there's a chance the recipient CC node will have turned into a leader in the meantime and doesn't know this isn't a real MsgAppResp to a MsgApp which the new leader may have sent (and while that shouldn't hinder correctness, it needs to be taken into account to avoid unwanted side effects). We could special-case uint64(0) to be used as a dummy logIndex, or Reject should be true, with the index of the original MsgApp as RejectHint. But it is simpler, safer and more descriptive to invent a new type MsgCC for exactly that purpose, so my suggestion is

  • adding a message type MsgCC (or MsgAppRespCC, whichever seems clearer)

  • moving the CC field to the Entry type

  • after sending a MsgAppResp, go through the entries in it and compile a MsgCC message for each node in the set of all Entry.CC fields in the entries

  • MsgCC.Entries should contain, for each original Entry that is being CC'ed to that member, an Entry that unique describes the original one and allows the receiving node to commit that entry locally when it has collected a quorum of identical descriptors (and also has a gapless history of previous indexes so that the log completeness property holds):

    • Term
    • Index

    I think those two should be enough (if a majority of servers accept a log entry with same index and same term, then that is committed and hence equal). Otherwise, can always add a hash of the original entry's Data or something along those lines in the Data field.

@bdarnell
Copy link
Contributor

If the CC'd node has become a leader, then the term number will have increased, so the old MsgAppResp will be rejected on that basis. We don't need to make the CC messages look distinct for safety purposes, although doing so to pass along extra information might help.

This looks like a lot of subtle work for a small gain, though. It only applies in cases where the client is talking to a follower which is proxying to the leader, and per discussion in #326 we want to minimize that behavior (if we do it at all).

@tbg
Copy link
Member Author

tbg commented Feb 20, 2015

Yes, if we don't want that, let's not waste time with it. It does look pretty straightforward now though. CC @xiang90 in case someone wants it for upstream raft.

@xiang90
Copy link
Contributor

xiang90 commented Feb 20, 2015

@tschottdorf I will take a close look tonight.

@xiang90
Copy link
Contributor

xiang90 commented Feb 20, 2015

@tschottdorf
Quick reply: I think it is doable and we could try to make this happen in raft. I am not sure if the gain worth it though. Unless you really care about the committing latency when proxying.

@bdarnell
One observation from my pervious raft/paxos/epaxos performance tuning experience: proxy is actually good in some cases, since it help with handling client request and encoding it to a []byte. It reduce CPU load on one instance a lot. But for multi-raft, probably not. A node will be both follower and leader. The roles will balance this.

@tbg
Copy link
Member Author

tbg commented Feb 20, 2015

@xiang90 don't sweat it, I just wanted you to be aware of the discussion we had about it in case you wanted something like that in etcd/raft.

@tamird tamird added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2015
@petermattis petermattis modified the milestone: 1.0 Feb 14, 2016
@petermattis petermattis changed the title Raft: send MsgAppResp to replica co-located with client storage: send MsgAppResp to replica co-located with client Mar 31, 2016
@tbg
Copy link
Member Author

tbg commented Sep 26, 2016

Doesn't seem that this is anywhere on the roadmap, and if so, it would probably have to be in etcd/raft anyway. Plus, our use case for this isn't really there given that we "always" propose at the leader (with few exceptions).

@tbg tbg closed this as completed Sep 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

6 participants