Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply term should be assigned when applying snapshot #10225

Closed
BusyJay opened this issue May 20, 2021 · 1 comment · Fixed by #10476
Closed

apply term should be assigned when applying snapshot #10225

BusyJay opened this issue May 20, 2021 · 1 comment · Fixed by #10476
Assignees
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. severity/critical sig/raft Component: Raft, RaftStore, etc. type/bug The issue is confirmed as a bug.

Comments

@BusyJay
Copy link
Member

BusyJay commented May 20, 2021

Bug Report

What version of TiKV are you using?

v5.0.1, though v4.0.x should also be affected.

What operating system and CPU are you using?

Doesn't matter.

Steps to reproduce

Suppose a region contains 3 replicas a, b and c, a is leader, c is isolated and has not been initialized. a and b decides to add a new replica d, and promote it as voter before it's initialized. After a has applied to last log, it sends a snapshot to d and initialize it with latest data. And then a transfer leader to d. Right after d wins the election, c is recovered and requests snapshot from d before d commits any entry.

What did you expect?

c will be initialized properly.

What did happened?

c will panic with following stack.

[2021/05/20 19:02:04.640 +00:00] [FATAL] [lib.rs:465] ["to_commit 41 is out of range [last_index 0], raft_id: 149208, region_id: 149205"] [backtrace="stack backtrace:
   0: tikv_util::set_panic_hook::{{closure}}
             at /tikv/components/tikv_util/src/lib.rs:464
   1: std::panicking::rust_panic_with_hook
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:595
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:497
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/sys_common/backtrace.rs:141
   4: rust_begin_unwind
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:493
   5: std::panicking::begin_panic_fmt
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:435
   6: raft::raft_log::RaftLog<T>::commit_to
             at /rust/git/checkouts/raft-rs-42b8049ef2e3af07/91a60ce/src/raft_log.rs:252
   7: raft::raft::Raft<T>::restore
             at /rust/git/checkouts/raft-rs-42b8049ef2e3af07/91a60ce/src/raft.rs:2389
      raft::raft::Raft<T>::handle_snapshot
             at /rust/git/checkouts/raft-rs-42b8049ef2e3af07/91a60ce/src/raft.rs:2307
   8: raft::raft::Raft<T>::step_follower
             at /rust/git/checkouts/raft-rs-42b8049ef2e3af07/91a60ce/src/raft.rs:2135
      raft::raft::Raft<T>::step
             at /rust/git/checkouts/raft-rs-42b8049ef2e3af07/91a60ce/src/raft.rs:1440
   9: raft::raw_node::RawNode<T>::step
             at /rust/git/checkouts/raft-rs-42b8049ef2e3af07/91a60ce/src/raw_node.rs:378
      raftstore::store::peer::Peer<EK,ER>::step
             at /tikv/components/raftstore/src/store/peer.rs:1139
      raftstore::store::fsm::peer::PeerFsmDelegate<EK,ER,T>::on_raft_message
             at /tikv/components/raftstore/src/store/fsm/peer.rs:1228
  10: raftstore::store::fsm::peer::PeerFsmDelegate<EK,ER,T>::handle_msgs
             at /tikv/components/raftstore/src/store/fsm/peer.rs:484
  11: <raftstore::store::fsm::store::RaftPoller<EK,ER,T> as batch_system::batch::PollHandler<raftstore::store::fsm::peer::PeerFsm<EK,ER>,raftstore::store::fsm::store::StoreFsm<EK>>>::handle_normal
             at /tikv/components/raftstore/src/store/fsm/store.rs:829
  12: batch_system::batch::Poller<N,C,Handler>::poll
             at /tikv/components/batch-system/src/batch.rs:295
  13: batch_system::batch::BatchSystem<N,C>::spawn::{{closure}}
             at /tikv/components/batch-system/src/batch.rs:399
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/sys_common/backtrace.rs:125
  14: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/thread/mod.rs:474
      <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/panic.rs:322
      std::panicking::try::do_call
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/panicking.rs:379
      std::panicking::try
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/panicking.rs:343
      std::panic::catch_unwind
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/panic.rs:396
      std::thread::Builder::spawn_unchecked::{{closure}}
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/thread/mod.rs:473
      core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/core/src/ops/function.rs:227
  15: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/alloc/src/boxed.rs:1484
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/alloc/src/boxed.rs:1484
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/sys/unix/thread.rs:71
  16: start_thread
  17: clone
"] [location=/rust/git/checkouts/raft-rs-42b8049ef2e3af07/91a60ce/src/raft_log.rs:252] [thread_name=raftstore-1-1]

After v4.0.0, snapshot are generated using apply index and apply term from apply worker. Apply worker initializes these fields when a peer fsm has applied the snapshot. But applying snapshot only updates apply index and leave apply term untouched, so apply worker will set apply term to a stale value until it applies next entry. If a snapshot is generated at that time, the snapshot will be set to a wrong log term.

Because raft-rs return 0 for any term query on entries beyond its logs, so term check can succeed. Receiver will fast-forward the snapshot and commit the index. Hence panic as receiver has no such log at all. Generally, PD only promotes initialized learner to voter, so the learner has to apply at least one log before it starts to campaign, apply term will be assigned to correct value before sending snapshots.

But if a long time isolated follower becomes leader right after accepting snapshot, it can still generate snapshots with wrong metadata. In this case, the stale term may not be zero and receiver can apply the data with wrong metadata. Though receiver should be able to find conflict when accepting next empty entry and request a new snapshot again. There seems to be no harm except wrong metadata itself.

@BusyJay BusyJay added sig/raft Component: Raft, RaftStore, etc. type/bug The issue is confirmed as a bug. labels May 20, 2021
@cosven
Copy link
Member

cosven commented May 28, 2021

/severity critical

BusyJay added a commit to BusyJay/tikv that referenced this issue Jun 28, 2021
After applying snapshot applied_term should also be updated, otherwise
it can produce snapshot with wrong term and cause panic in follower.

Close tikv#10225

Signed-off-by: Jay Lee <[email protected]>
ti-chi-bot added a commit that referenced this issue Jun 29, 2021
* add test case

Signed-off-by: Jay Lee <[email protected]>

* raftstore: update applied_term after snapshot

After applying snapshot applied_term should also be updated, otherwise
it can produce snapshot with wrong term and cause panic in follower.

Close #10225

Signed-off-by: Jay Lee <[email protected]>

Co-authored-by: Ti Chi Robot <[email protected]>
tiancaiamao pushed a commit to tiancaiamao/tikv that referenced this issue Aug 11, 2021
* add test case

Signed-off-by: Jay Lee <[email protected]>

* raftstore: update applied_term after snapshot

After applying snapshot applied_term should also be updated, otherwise
it can produce snapshot with wrong term and cause panic in follower.

Close tikv#10225

Signed-off-by: Jay Lee <[email protected]>

Co-authored-by: Ti Chi Robot <[email protected]>
Signed-off-by: tiancaiamao <[email protected]>
@BusyJay BusyJay added affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. labels Dec 1, 2021
BusyJay added a commit to BusyJay/tikv that referenced this issue Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 This bug affects 4.0.x versions. affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. severity/critical sig/raft Component: Raft, RaftStore, etc. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants