Nebula resiliancey #5335

porscheme · 2023-02-13T06:10:51Z

Cluster config
metad: 3
Storaged: 3
metad: 5
Each of the storage node has 2 X 2TB SSD NVMe Disks
Space Config
VID: String (Length 20)
Partition Number: 200
Replica Factor: 3

Our cluster is running in Azure. We enabled auto patching & upgrade (Kubernetes upgrade). Often times manual intervention is required when VM is stuck in upgrade.

When above happens, we cannot query nebula cluster. Is this expected, Has anyone seen this?
When VM comes back up, some times we may loose media on PVC. In this scenarios, does Nebula recover media from other replicas?

MegaByte875 · 2023-02-14T08:24:42Z

@porscheme I have some questions about your scenario:

How does the graphd, metad, storaged pods distribute on k8s Nodes
Do you set the PDB to guarantee service availability
Do you use SSD NVMe disks for storaged
Whether the partition leader exists on the storage node before upgrading

porscheme · 2023-02-14T15:07:18Z

Thanks @MegaByte875 for the reply
Cluster version we are using is v3.3.0

@porscheme I have some questions about your scenario:

How does the graphd, metad, storaged pods distribute on k8s Nodes

[Porsche] Each one of the component has its own separate K8s node pool and subnet. We are using Azure Standard_L16as_v3 SKU. Auto scale disabled

Do you set the PDB to guarantee service availability

[Porsche] No, we are not using any PDB since Nebula official docs doesn't mention about it. Should we use PDB, can you point me to any Nebula docs?

Do you use SSD NVMe disks for storaged

[Porsche] Yes, Azure Standard_L16as_v3 SKU comes with two 2 TB SSD NVMe disks attached to the VM

Whether the partition leader exists on the storage node before upgrading

[Porsche] Yes before upgrade, the partition leader does exists on the storage node and it was balanced with other storage nodes. But during upgrade, leaders on the storage node become zero (SHOW HOSTS)

porscheme · 2023-02-16T14:55:35Z

Azure patched our K8S cluster today and the Nebula cluster was down.
Nebula reliability metrics went down!!!

While AKS was patching, as expected we saw a stand by VM was brought up
Not sure if Nebula recognized this new VM?
I wonder, how world wide Nebula installations handling this situation?
What are the options available?
How does Tigergraph & Neo4J handle these situations, does anyone know?

MegaByte875 · 2023-03-27T09:36:22Z

Here is an implementation plan, I wish will help you:

Add new buffer nodes to the cluster that runs the specified Kubernetes version.
Cordon and drain one of the old nodes to minimize interruptions to running applications.
1. Limit the number of Pods that can be evicted at one time using a PDB to control the scale of unavailable nodes.
2. Use ValidatingAdmissionWebhook to request the cluster to perform pre-offline cleanup and preparation work before the Pod receives a deletion request.
  1. The operator controller watches for Pod change events.
  2. The operator controller starts to synchronize object states and attempts to delete the Pods that need to be evicted.
  3. The kube-apiserver calls the operator webhook interface.
  4. The webhook server requests the cluster to perform pre-offline preparation work (the request is idempotent) and checks if the preparation work is completed. If the preparation is completed, deletion is allowed. If not, deletion is denied.
  5. The process loops back to step 2 due to the control loop of the operator.
When the old nodes are completely drained, they will reset the VM image to upgrade to the new version and become buffer nodes for the next node to upgrade.
This process repeats until all nodes in the cluster are upgraded.
At the end of this process, the buffer nodes used for the upgrade will be deleted to maintain the current number of nodes and regional balance. @porscheme

porscheme added the type/bug Type: something is unexpected label Feb 13, 2023

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Feb 13, 2023

porscheme mentioned this issue Feb 14, 2023

Provide PodDisruptionBudget spec vesoft-inc/nebula-operator#178

Open

wey-gu mentioned this issue Feb 18, 2023

Weekly Report 2023-02-17 vesoft-inc/nebula-community#326

Closed

xtcyclist assigned MegaByte875 Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nebula resiliancey #5335

Nebula resiliancey #5335

porscheme commented Feb 13, 2023

MegaByte875 commented Feb 14, 2023

porscheme commented Feb 14, 2023 •

edited by wey-gu

Loading

porscheme commented Feb 16, 2023

MegaByte875 commented Mar 27, 2023

Nebula resiliancey #5335

Nebula resiliancey #5335

Comments

porscheme commented Feb 13, 2023

MegaByte875 commented Feb 14, 2023

porscheme commented Feb 14, 2023 • edited by wey-gu Loading

porscheme commented Feb 16, 2023

MegaByte875 commented Mar 27, 2023

porscheme commented Feb 14, 2023 •

edited by wey-gu

Loading