Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod/failure-injection: add initial framework for failure injection library #140548

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

DarrylWong
Copy link
Contributor

@DarrylWong DarrylWong commented Feb 5, 2025

This PR adds the initial framework for the failure injection library within roachprod. The failures package can now be used which adds the FailureMode interface. A FailureMode describes a failure that can be injected into a roachprod cluster along with how to revert the failure. Additionally, it also adds the first supported failure: iptables network partitions.

See individual commits for details.

Release note: none
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-46439
Informs: #138970


I have a WIP branch here with rough implementations of the CLI, roachtest refactoring, and disk stall failures if you're curious how that works. I wanted to keep this PR small though to keep things reviewable + get feedback before I get too deep.

This helper will be used in the new failure injection
library.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DarrylWong DarrylWong force-pushed the fi-lib branch 2 times, most recently from 8c62328 to 0029b12 Compare February 6, 2025 19:18
This commit adds the framework for the failure injection
library, as well as the first supported failure: iptables
network partitions.

This failure can be used on roachprod clusters to create
bidirectional and asymmetric network partitions between
node(s).
This registry will allow for future usage of the failure
injection library through the CLI and the failure injection
planner/controller.
This adds an integration test for the failure injection library.
The test spins up a cluster and randomly selects a failure to
inject. It then validates that the failure was correctly injected.
Afterwards, it reverts the failure and validates that the failure
was correctly cleaned up.
@DarrylWong DarrylWong marked this pull request as ready for review February 7, 2025 17:50
@DarrylWong DarrylWong requested a review from a team as a code owner February 7, 2025 17:50
@DarrylWong DarrylWong requested review from herkolategan and srosenberg and removed request for a team February 7, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants