Skip to content

Commit

Permalink
docs: document procedure to migrate a live cluster to three-data-hall…
Browse files Browse the repository at this point in the history
… redundancy (#2191)
  • Loading branch information
gm42 authored Jan 28, 2025
1 parent 984fc7f commit 6a7ccc3
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 2 deletions.
4 changes: 4 additions & 0 deletions config/tests/three_data_hall/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,7 @@ This will remove all created resources:
```bash
kubectl delete -f ./config/tests/three_data_hall/cluster.yaml
```

## Migration to a Three-Data-Hall cluster

See [Fault domains](../../../docs/manual/fault_domains.md)
22 changes: 21 additions & 1 deletion docs/manual/fault_domains.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,27 @@ spec:

Once all three `FoundationDBCluster` resources are marked as reconciled the FoundationDB cluster is up and running.
You can run this configuration in the same namespace, different namespaces or even across multiple different Kubernetes clusters.
Operations across the different `FoundationDBCluster` resources are [coordinated](#coordinating-global-operations).
Operations across the different `FoundationDBCluster` resources are [coordinated](#coordinating-global-operations).

### Migrating an existing cluster to Three-Data-Hall Replication

NOTE: these steps are for the `split` image setup, which is still the default; for the `unified` image setup migration path is simpler.

It is possible to gracefully migrate a cluster in single, double or triple replication to Three-Data-Hall replication by following these steps:

1. create 3 exact clones of the original k8s FoundationDB cluster object (thus still with same replication) and change these fields: `metadata.name`,`spec.processGroupIDPrefix`, `spec.dataHall`, `spec.dataCenter`
2. make sure that `skip: true` is set on their YAML definition so that operator will not attempt to configure them
3. make sure that `seedConnectionString:` is set to to the current connection string of the original cluster
4. run `kubectl create` for each of them
5. set the configured state and connection string in their status subresource by using: `kubectl patch fdb my-new-cluster-a --type=merge --subresource status --patch "status: {configured: true, connectionString: \"...\" }"`; use the original cluster connection string here
6. set `skip: false` on each of the 3 new clusters, and then wait for reconciliation to finish
7. start a lengthy exclude procedure that will exclude all the processes of the original cluster; suggested order: `log`, `storage`, `coordinator`, `stateless`
8. delete the original cluster once all exclusions are complete
9. set `redundancyMode` to `three_data_hall` for the 3 new FoundationDB clusters, one after another
10. patch seed connection string of 2 of the 3 new clusters to point to the third one e.g. if you have created clusters A,B,C, set the seed connection string of B and C to point to A and make sure that A has no seed connection string; this step is not crucial but practically helpful sometimes
11. scale down clusters to use 1/3 of the original resources (each of them needs only 3 coordinators and 1/3 of the resources used for other classes)

This procedure mitigates temporary issues (`1031` timeouts) which may happen with sustained traffic when data distributor and master are being reallocated while redundancy mode is changed and/or data is being moved.

## Multi-Region Replication

Expand Down
2 changes: 1 addition & 1 deletion docs/manual/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The following localities will be set by the operator:
- `--locality_machineid`: The value will be set depending on the fault domain key. For `foundationdb.org/none`, this will be the Pod's name, for all other cases this will be the node name on which the pod is running.
- `--locality_zoneid`: The value will be set depending on the fault domain key. For `foundationdb.org/none`, this will be the Pod's name, otherwise this will be the node name per default where the Pod is running. If `ValueFrom` is defined in the fault domain this value will be used. If `foundationdb.org/kubernetes-cluster` is specified as fault domain key the predefined `value` will be used.
- `--locality_dcid`: This value will be set to the value defined in `cluster.Spec.DataCenter`, if this value is not set the locality will not be set. This locality is used for FoundationDB deployments in multiple datacenters/Kubernetes clusters.
- `--locality_data_hall`: This value will be set to the value defined in `cluster.Spec.DataHall`, if this value is not set the locality will not be set. Currently this locality doesn't have any affect, but will be used in the future for `three_data_hall` replication.
- `--locality_data_hall`: This value will be set to the value defined in `cluster.Spec.DataHall`, if this value is not set the locality will not be set. It is used for `three_data_hall` replication.
- `--locality_dns_name`: This value will only be set if `cluster.Spec.Routing.DefineDNSLocalityFields` is set to true. The value will be set to the `FDB_DNS_NAME` environment variable, which is set by the operator.

The operator uses the `locality_instance_id` to identify the process from the [machine-readable status](https://apple.github.io/foundationdb/mr-status.html) and match it to the according process group managed by the operator.
Expand Down

0 comments on commit 6a7ccc3

Please sign in to comment.