Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd loses track of leases #14019

Closed
cjbottaro opened this issue May 8, 2022 · 4 comments
Closed

Etcd loses track of leases #14019

cjbottaro opened this issue May 8, 2022 · 4 comments

Comments

@cjbottaro
Copy link

What happened?

Etcd thinks these leases still exist...

%{
   header: %{
     cluster_id: 8855822249941472443,
     member_id: 16980875726527512478,
     raft_term: 252,
     revision: 16834726
   },
   leases: [
     %{ID: 3492964341253159182},
     %{ID: 4584242825964808783},
     %{ID: 362681125277541271},
     %{ID: 3492964341253159049},
     %{ID: 362681125277541277},
     %{ID: 362681125277541327},
     %{ID: 362681125277541281},
     %{ID: 362681125277541309},
     %{ID: 3492964341253159261},
     %{ID: 362681125277541235},
     %{ID: 362681125277541359},
     %{ID: 362681125277541357},
     %{ID: 362681125277541263},
     %{ID: 362681125277541267},
     %{ID: 362681125277541319},
     %{ID: 3492964341253159252},
     %{ID: 4584242825964808733},
...

Each lease has a ttl of 5s. We shutdown our system (no new leases created) and these leases still exist over 24h later.

Trying to revoke any of these leases results in an error saying they don't exist.

:eetcd_lease.revoke(Etcd, 3492964341253159182)
{:error,
 {:grpc_error,
  %{"grpc-message": "etcdserver: requested lease not found", "grpc-status": 5}}}

We had to shutdown our system because asking Etcd for a lease would result in a timeout. After a few hours of our system being shutdown, Etcd stopped timing out when asking for a lease.

What did you expect to happen?

Etcd to continue to grant leases under heavy load and not lose track of leases.

How can we reproduce it (as minimally and precisely as possible)?

Locking/unlocking unique locks at a rate of 80 per second, each using a lease with ttl=5. Letting that run for a day or so.

Anything else we need to know?

We use Etcd for distributed locking and nothing else. Each lease is 5s. We have a 3 node cluster running on a single Core i3-8100 machine. We are only requesting about 80 locks per second.

--auto-compaction-retention=5m

It seems this setting is reasonable given that our locks have a ttl of 5s.

Etcd version (please run commands below)

quay.io/coreos/etcd:v3.5.3

Etcd configuration (command line flags or environment variables)

--auto-compaction-retention=5m

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ kubectl exec etcd-0-6776dd499d-k92mn -- etcdctl member list -w table
+------------------+---------+--------+--------------------+--------------------+------------+
|        ID        | STATUS  |  NAME  |     PEER ADDRS     |    CLIENT ADDRS    | IS LEARNER |
+------------------+---------+--------+--------------------+--------------------+------------+
| 554ff50439e23079 | started | etcd-0 | http://etcd-0:2380 | http://etcd-0:2379 |      false |
| 5ba2a9fe8d840508 | started | etcd-2 | http://etcd-2:2380 | http://etcd-2:2379 |      false |
| eba8308136b9bf9e | started | etcd-1 | http://etcd-1:2380 | http://etcd-1:2379 |      false |
+------------------+---------+--------+--------------------+--------------------+------------+

Relevant log output

No response

@ahrtr
Copy link
Member

ahrtr commented May 8, 2022

Probably you are running into 13205. Could you try to reproduce this issue using the latest code in release-3.5 or main branches.

Please also provide the following info so that others can take a closer look.

  1. The complete etcd command-line configuration;
  2. The detailed steps to reproduce this issue.

@ahrtr
Copy link
Member

ahrtr commented Sep 8, 2022

@cjbottaro any update on this? Have you enabled auth in the cluster?

@cjbottaro
Copy link
Author

Ahh, I stopped using Etcd for locks and went back to using single node Redis for now. Will come back to Etcd if the need for HA ever arises though.

@stale
Copy link

stale bot commented Dec 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 31, 2022
@stale stale bot closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants