Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error updating the cluster architecture #587

Closed
ajzach opened this issue Jul 18, 2024 · 25 comments
Closed

Error updating the cluster architecture #587

ajzach opened this issue Jul 18, 2024 · 25 comments

Comments

@ajzach
Copy link

ajzach commented Jul 18, 2024

Hello, I am having an issue with one of our clusters (AWS Elasticache) that has autoscaling configured. The cluster has a minimum of 10 and a maximum of 15. During the day, the cluster scaled up to 13 nodes, and the client correctly detected the new nodes. However, it then scaled back down to 10, and we started seeing errors where the client was trying to resolve the domain of the nodes that had been removed, causing errors. I tested adding the 11th node back, and the client detected it, but once I removed it again, the errors reappeared. It seems like the client is not updating the cluster architecture internally. We are using version 1.0.35.

@rueian
Copy link
Collaborator

rueian commented Jul 19, 2024

Hi @ajzach, did the errors disappear eventually?

@ajzach
Copy link
Author

ajzach commented Jul 20, 2024

Hi @ajzach, did the errors disappear eventually?

Hello @rueian, no, we saw the errors for hours, they disappeared when I added the nodes back to the cache.

@proost
Copy link
Contributor

proost commented Jul 20, 2024

I also experienced a similar issue. When i scale down nodes in cluster, client doesn't update correctly. currently i digging it now.

@rueian
Copy link
Collaborator

rueian commented Jul 20, 2024

Hi @ajzach,

I have just tested AWS Elasticache by deleting shards manually, but I didn't see the DNS error you described and rueidis refreshed the cluster topology successfully by issuing the CLUSTER SHARDS command to the configuration endpoint provided by AWS.

image

Additionally, I found the AWS Elasticache responds to the CLUSTER SHARDS with IP addresses instead of domain names.

the client was trying to resolve the domain of the nodes that had been removed, causing errors.

So this looks weird to me. Did you manually put the domain names of nodes into the rueidis.ClientOption.InitAddress? In the case of AWS Elasticache, the InitAddress should contain only the configuration endpoint.

@ajzach
Copy link
Author

ajzach commented Jul 20, 2024

To connect to Redis, I just use the endpoint. I performed the same tests adding and removing nodes, including failover, and everything worked correctly. This error is not very common; currently, we have more than 100 applications using Rueidis, and only 2 have reported this problem. It seems that the error occurs with more prolonged use of the client. Some condition causes the client to stop updating the cluster architecture internally and to retain in memory nodes that were deleted.

@rueian
Copy link
Collaborator

rueian commented Jul 21, 2024

Hi @ajzach,

Rueidis sends CLUSTER SHARDS to the configuration endpoint to get the latest cluster topology whenever an error occurs or a Redis MOVED response is received. Based on your description, the mechanism was still working since those domain name errors disappeared when you added new nodes. So I think it was more likely that your configuration endpoint kept giving you stale topology until a new node was added.

The current rueidis always gets the latest cluster topology from the configuration endpoint only, but it seems not to work well with your Elasticache cluster. Would you like to try the new v1.0.42-alpha? The new version will send CLUSTER SHARDS to a known node randomly. I think it will reduce the chance of getting stale information.

@proost
Copy link
Contributor

proost commented Jul 21, 2024

It is hard to reproduce situation. I agree with rueian. i guess that configuration endpoint returns stale cluster information. because i scale in/out, scale up/down clusters sometiems, but this is my first time not to update cluster topology.

@rueian
Copy link
Collaborator

rueian commented Jul 21, 2024

Thanks @proost! Would you also like to try the v1.0.42-alpha? And how old is your Elasticache cluster? I found that a newly created Elasticache forms the cluster with IP addresses instead of domain names. It is even impossible for me to get a domain name resolution error.

@proost
Copy link
Contributor

proost commented Jul 21, 2024

@rueian

how old is your Elasticache cluster

do you mean version of redis or operating time?

@rueian
Copy link
Collaborator

rueian commented Jul 21, 2024

@rueian

how old is your Elasticache cluster

do you mean version of redis or operating time?

Maybe both. I think it is possible that clusters have differences even on a same redis version but created on different date.

@proost
Copy link
Contributor

proost commented Jul 22, 2024

@rueian
I ran the redis cluster about 13 months. About 4weeks ago, i scale down the cluster.

I use redis 7.0.7 version

@ajzach
Copy link
Author

ajzach commented Jul 22, 2024

Hi @ajzach,

Rueidis sends CLUSTER SHARDS to the configuration endpoint to get the latest cluster topology whenever an error occurs or a Redis MOVED response is received. Based on your description, the mechanism was still working since those domain name errors disappeared when you added new nodes. So I think it was more likely that your configuration endpoint kept giving you stale topology until a new node was added.

The current rueidis always gets the latest cluster topology from the configuration endpoint only, but it seems not to work well with your Elasticache cluster. Would you like to try the new v1.0.42-alpha? The new version will send CLUSTER SHARDS to a known node randomly. I think it will reduce the chance of getting stale information.

At the moment I was experiencing errors, I accessed the application instance and queried the nodes directly at the endpoint; the nodes that had been removed were not listed.

@rueian
Copy link
Collaborator

rueian commented Jul 22, 2024

Hi @ajzach, that was probably because at that time the configuration endpoint resolved to a relatively new node while rueidis kept an old connection to an old node.

@rueian
Copy link
Collaborator

rueian commented Jul 23, 2024

Hi @ajzach, the v1.0.42-alpha should reduce the chance of getting stale information from an old connection. Please let me know if you have tried it.

@ajzach
Copy link
Author

ajzach commented Jul 23, 2024

I remember that the specialization in the AWS configuration was done for a reason. What are the implications of removing it?

@rueian
Copy link
Collaborator

rueian commented Jul 23, 2024

The configuration endpoint is essentially a DNS alias to all nodes according to an AWS Redis team member. The specialization was made before I knew that fact and with a wrong assumption that the endpoint was a special program that was responsible for cluster topology. So the specialization is actually meaningless.

@dntam00
Copy link
Contributor

dntam00 commented Aug 23, 2024

Hi @rueian,
Yesterday, I tested redis cluster failover in aws, the topology is 1 shard master node + 2 replicas, there are 2 cases:

  • success
  • failed

in failed case, it was strange behavior based on what I've captured:

  • before failover: client send request to replica node, I think it should not be like this, because client is not allowed to read/write to replica node, it should receive moved response.
  • after failover: client receives moved response because previous master node has became replica node, but there is no cluster shards command is sent to redis after that.

I think rueidis refreshes connections each time it receives a moved response, but currently I do not see it (by using tcpdump). I'm using rueidis v1.0.37. Client receives moved from replica node without sending cluster shards to redis, this phenomenon last about 10 hours and still keep continue, to make client work again, I have to restart it.

Does it relate to this patch https://github.com/redis/rueidis/releases/tag/v1.0.42-alpha?
Thank you very much.

@rueian
Copy link
Collaborator

rueian commented Aug 23, 2024

Hi @dangngoctam00,

That looks very weird. Have you ever seen the cluster shards command been sent after initialization?

@dntam00
Copy link
Contributor

dntam00 commented Aug 23, 2024

Hi @rueian , I've not checked it in failed case, maybe I need to test it again, but based on tcpdump capture, in 3 minutes until failover, there is no cluster shards command.
About success case, cluster shards is sent after failover process.

@proost
Copy link
Contributor

proost commented Aug 23, 2024

@dangngoctam00 could you update rueidis version to equal or above than 1.0.42? In 1.0.42, Including 1.0.42-alpha commits and also adding rueidis client change the connection immediately before connection is refreshed when MOVED response received.

Separtely from version up, Not sending cluster shards command is very weird.

@proost
Copy link
Contributor

proost commented Aug 23, 2024

@dangngoctam00 could you try again after bump up rueidis version?

@dntam00
Copy link
Contributor

dntam00 commented Aug 23, 2024

Hi @proost , I will upgrade version to 1.0.43 and try again next Monday, I'll reply in this thread.
Anyway, I still want to understand why client got stuck and kept receive moved response, but I couldn't reproduce at local with aws alias host.
Thank you!

@dntam00
Copy link
Contributor

dntam00 commented Aug 24, 2024

Hi @rueian , @proost ,
maybe there is a blocking in function _refresh of cluster client at line

result = <-results
.
When there are multiple connections and there is only 1 aws endpoint, function getClusterSlots will be executed only one time and there is no message in channel results.
It leads to this line
c.ch = nil
will not be executed, so the following times of LazyDo will be ignored.
Could you review it? Thank you.

@rueian
Copy link
Collaborator

rueian commented Aug 25, 2024

Hi @dangngoctam00,

Thank you for looking into the details. If that was the case then versions after v1.0.42 should have fixed it.

I also dropped the v1.0.45-alpha.2 which adds timeout on the getClusterSlots to avoid it from being stuck. Could you give it a try?

@dntam00
Copy link
Contributor

dntam00 commented Aug 25, 2024

Hi @rueian , I've tried with v1.0.45-alpha.2 and client refreshes redis cluster topology normally.
FYI: when testing, I've also found this bug: [BUG] CLUSTER SHARDS command returns "empty array" in slots section, so I also think it's good to send cluster shards randomly to cluster nodes like you said before.
Thank you very much.

@rueian rueian closed this as completed Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants