Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The server flag should be skipped or present an error during a cluster-reset #3178

Closed
rancher-max opened this issue Jul 22, 2022 · 2 comments
Assignees

Comments

@rancher-max
Copy link
Member

Environmental Info:
RKE2 Version:

All v1.22, v1.23, and v1.24

Node(s) CPU architecture, OS, and Version:

Any

Cluster Configuration:

3 servers 1 agent, running cluster-reset on 1 server that was NOT the initial server

Describe the bug:

When running cluster-reset on a server node that was not the initial bootstrap server, and therefore has the server flag in the config.yaml, it fails to complete successfully. Instead, it constantly loops with:

INFO[0080] Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: connection refused" 
INFO[0080] Waiting for etcd server to become available  
{"level":"warn","ts":"2022-07-22T21:48:23.893Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0015a8700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0080] Failed to get apiserver address from etcd: context deadline exceeded 
ERRO[0081] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:38846->127.0.0.1:6444: read: connection reset by peer 
ERRO[0083] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:38858->127.0.0.1:6444: read: connection reset by peer 

If I remove the server flag from the config, the reset is successful.

Steps To Reproduce:

  • Install rke2 and join 2 other server nodes
  • Run rke2-killall.sh on the initial server node and one other server node
  • Attempt cluster-reset on the remaining server node:
$ sudo systemctl stop rke2-server
$ sudo rke2 server --cluster-reset

Expected behavior:

The command should complete successfully.

Actual behavior:

The command loops forever with the logs shown above.

Additional context / logs:

N/A

@brandond
Copy link
Member

brandond commented Jul 22, 2022

This should probably be handled as an error. It does not make sense to allow a server with the --server flag set to also run --cluster-reset; having --server set explicitly asks the node to join an existing cluster. The user should remove the --server flag from the configuration if they want to disable the joining behavior.

Ignoring it is also an option, but I think the user should need to explicitly remove the server configuration.

@fmoral2
Copy link
Contributor

fmoral2 commented Oct 11, 2023

Validated on Version:

-$  rke2 version v1.28.2+dev.f8cb4092 (f8cb409287ac09608d3d2fbac364dddb05400a3e)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
NAME="SLES"
VERSION="15-SP4"

Cluster Configuration:
3 node servers

Steps to validate the fix

  1. Install rke2 and join 2 other server nodes
  2. Run rke2-killall.sh on the initial server node and one other server node
  3. Attempt cluster-reset on the remaining server node:
$ sudo systemctl  stop rke2
$ sudo rke2 server --cluster-reset

Reproduction Issue:

<details>



$ sudo systemctl  stop rke2-server

$ sudo rke2 server --cluster-reset
rke2 -v
rke2 version v1.28.2+dev.9e798492 (9e79849266360b1aa27fabe0d981bcc6bc069858)
go version go1.20.8 X:boringcrypto




{"level":"warn","ts":"2023-10-06T16:44:02.887582Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000798fc0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0053] Failed to get apiserver address from etcd: context deadline exceeded 
ERRO[0053] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58348->127.0.0.1:6444: read: connection reset by peer 
ERRO[0055] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58352->127.0.0.1:6444: read: connection reset by peer 
ERRO[0057] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58356->127.0.0.1:6444: read: connection reset by peer 
{"level":"warn","ts":"2023-10-06T16:44:07.890189Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000799180/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0058] Failed to get apiserver address from etcd: context deadline exceeded 
ERRO[0059] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58360->127.0.0.1:6444: read: connection reset by peer 
ERRO[0061] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58364->127.0.0.1:6444: read: connection reset by peer 
{"level":"warn","ts":"2023-10-06T16:44:12.480244Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d7c00/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"info","ts":"2023-10-06T16:44:12.480423Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
{"level":"warn","ts":"2023-10-06T16:44:12.482186Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d7c00/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
INFO[0063] Failed to test data store connection: context deadline exceeded 
INFO[0063] Waiting for etcd server to become available  
{"level":"warn","ts":"2023-10-06T16:44:12.890366Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000855880/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0063] Failed to get apiserver address from etcd: context deadline exceeded 
ERRO[0063] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58368->127.0.0.1:6444: read: connection reset by peer 
ERRO[0065] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58372->127.0.0.1:6444: read: connection reset by peer 
ERRO[0067] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58376->127.0.0.1:6444: read: connection reset by peer 
{"level":"warn","ts":"2023-10-06T16:44:17.89164Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000e34000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
WARN[0068] Failed to get apiserver address from etcd: context deadline exceeded 
ERRO[0069] failed to get CA certs: Get "https://127.0.0.1:6444/cacerts": read tcp 127.0.0.1:58382->127.0.0.1:6444: read: connection reset by peer 


Validation Results:


$ rke2 -v 
rke2 version v1.28.2+dev.f8cb4092 (f8cb409287ac09608d3d2fbac364dddb05400a3e)

$ sudo systemctl  stop rke2-server

$ sudo rke2 server --cluster-reset

FATA[0000] cannot perform cluster-reset while server URL is set - remove server from configuration before resetting 


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants