Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: clean states in local barrier manager after actor dropped (#7082)
Trying to fix continuous recovery found in longevity and chaos test. I found that two problems might be the root cause of continuous recovery: 1. Fixed, unnecessary recovery triggered as described in #6989 . As I tested locally, when workload was very high, there were many ongoing barrier collect responses(up to 80+) when recovery. After recovery finished, each response would trigger a recovery process, because the whole cluster has already reset to previous committed epoch. 2. Before this PR, when force stopping actors in CN, the local manger will clean all states and then abort all actors. The problem is between cleaning states and aborting actors, the actors could also report epoch collected or error status to local barrier manager especially when the number of actors is high. This will cause a chain reaction in recovery. I tested it locally and the recovery became normal. Besides, it could also be the cause of #6639 , #6715 . Approved-By: fuyufjh Approved-By: BugenZhao
- Loading branch information