Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nebula resiliancey #5335

Open
porscheme opened this issue Feb 13, 2023 · 4 comments
Open

Nebula resiliancey #5335

porscheme opened this issue Feb 13, 2023 · 4 comments
Assignees
Labels
affects/none PR/issue: this bug affects none version. severity/none Severity of bug type/bug Type: something is unexpected

Comments

@porscheme
Copy link

  • Cluster config
    metad: 3
    Storaged: 3
    metad: 5
    Each of the storage node has 2 X 2TB SSD NVMe Disks

  • Space Config
    VID: String (Length 20)
    Partition Number: 200
    Replica Factor: 3

Our cluster is running in Azure. We enabled auto patching & upgrade (Kubernetes upgrade). Often times manual intervention is required when VM is stuck in upgrade.

  • When above happens, we cannot query nebula cluster. Is this expected, Has anyone seen this?
  • When VM comes back up, some times we may loose media on PVC. In this scenarios, does Nebula recover media from other replicas?
@porscheme porscheme added the type/bug Type: something is unexpected label Feb 13, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Feb 13, 2023
@MegaByte875
Copy link

@porscheme I have some questions about your scenario:

  • How does the graphd, metad, storaged pods distribute on k8s Nodes
  • Do you set the PDB to guarantee service availability
  • Do you use SSD NVMe disks for storaged
  • Whether the partition leader exists on the storage node before upgrading

@porscheme
Copy link
Author

porscheme commented Feb 14, 2023

Thanks @MegaByte875 for the reply
Cluster version we are using is v3.3.0

@porscheme I have some questions about your scenario:

  • How does the graphd, metad, storaged pods distribute on k8s Nodes

[Porsche] Each one of the component has its own separate K8s node pool and subnet. We are using Azure Standard_L16as_v3 SKU. Auto scale disabled

  • Do you set the PDB to guarantee service availability

[Porsche] No, we are not using any PDB since Nebula official docs doesn't mention about it. Should we use PDB, can you point me to any Nebula docs?

  • Do you use SSD NVMe disks for storaged

[Porsche] Yes, Azure Standard_L16as_v3 SKU comes with two 2 TB SSD NVMe disks attached to the VM

  • Whether the partition leader exists on the storage node before upgrading

[Porsche] Yes before upgrade, the partition leader does exists on the storage node and it was balanced with other storage nodes. But during upgrade, leaders on the storage node become zero (SHOW HOSTS)

@porscheme
Copy link
Author

Azure patched our K8S cluster today and the Nebula cluster was down.
Nebula reliability metrics went down!!!

  • While AKS was patching, as expected we saw a stand by VM was brought up
  • Not sure if Nebula recognized this new VM?
  • I wonder, how world wide Nebula installations handling this situation?
  • What are the options available?
  • How does Tigergraph & Neo4J handle these situations, does anyone know?

@MegaByte875
Copy link

Here is an implementation plan, I wish will help you:

  • Add new buffer nodes to the cluster that runs the specified Kubernetes version.
  • Cordon and drain one of the old nodes to minimize interruptions to running applications.
    1. Limit the number of Pods that can be evicted at one time using a PDB to control the scale of unavailable nodes.
    2. Use ValidatingAdmissionWebhook to request the cluster to perform pre-offline cleanup and preparation work before the Pod receives a deletion request.
      1. The operator controller watches for Pod change events.
      2. The operator controller starts to synchronize object states and attempts to delete the Pods that need to be evicted.
      3. The kube-apiserver calls the operator webhook interface.
      4. The webhook server requests the cluster to perform pre-offline preparation work (the request is idempotent) and checks if the preparation work is completed. If the preparation is completed, deletion is allowed. If not, deletion is denied.
      5. The process loops back to step 2 due to the control loop of the operator.
  • When the old nodes are completely drained, they will reset the VM image to upgrade to the new version and become buffer nodes for the next node to upgrade.
  • This process repeats until all nodes in the cluster are upgraded.
  • At the end of this process, the buffer nodes used for the upgrade will be deleted to maintain the current number of nodes and regional balance. @porscheme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

2 participants