Error updating the cluster architecture #587

ajzach · 2024-07-18T22:35:21Z

Hello, I am having an issue with one of our clusters (AWS Elasticache) that has autoscaling configured. The cluster has a minimum of 10 and a maximum of 15. During the day, the cluster scaled up to 13 nodes, and the client correctly detected the new nodes. However, it then scaled back down to 10, and we started seeing errors where the client was trying to resolve the domain of the nodes that had been removed, causing errors. I tested adding the 11th node back, and the client detected it, but once I removed it again, the errors reappeared. It seems like the client is not updating the cluster architecture internally. We are using version 1.0.35.

rueian · 2024-07-19T15:57:39Z

Hi @ajzach, did the errors disappear eventually?

ajzach · 2024-07-20T00:08:23Z

Hi @ajzach, did the errors disappear eventually?

Hello @rueian, no, we saw the errors for hours, they disappeared when I added the nodes back to the cache.

proost · 2024-07-20T07:34:09Z

I also experienced a similar issue. When i scale down nodes in cluster, client doesn't update correctly. currently i digging it now.

rueian · 2024-07-20T15:00:23Z

Hi @ajzach,

I have just tested AWS Elasticache by deleting shards manually, but I didn't see the DNS error you described and rueidis refreshed the cluster topology successfully by issuing the CLUSTER SHARDS command to the configuration endpoint provided by AWS.

Additionally, I found the AWS Elasticache responds to the CLUSTER SHARDS with IP addresses instead of domain names.

the client was trying to resolve the domain of the nodes that had been removed, causing errors.

So this looks weird to me. Did you manually put the domain names of nodes into the rueidis.ClientOption.InitAddress? In the case of AWS Elasticache, the InitAddress should contain only the configuration endpoint.

ajzach · 2024-07-20T15:22:27Z

To connect to Redis, I just use the endpoint. I performed the same tests adding and removing nodes, including failover, and everything worked correctly. This error is not very common; currently, we have more than 100 applications using Rueidis, and only 2 have reported this problem. It seems that the error occurs with more prolonged use of the client. Some condition causes the client to stop updating the cluster architecture internally and to retain in memory nodes that were deleted.

rueian · 2024-07-21T07:55:16Z

Hi @ajzach,

Rueidis sends CLUSTER SHARDS to the configuration endpoint to get the latest cluster topology whenever an error occurs or a Redis MOVED response is received. Based on your description, the mechanism was still working since those domain name errors disappeared when you added new nodes. So I think it was more likely that your configuration endpoint kept giving you stale topology until a new node was added.

The current rueidis always gets the latest cluster topology from the configuration endpoint only, but it seems not to work well with your Elasticache cluster. Would you like to try the new v1.0.42-alpha? The new version will send CLUSTER SHARDS to a known node randomly. I think it will reduce the chance of getting stale information.

proost · 2024-07-21T09:00:51Z

It is hard to reproduce situation. I agree with rueian. i guess that configuration endpoint returns stale cluster information. because i scale in/out, scale up/down clusters sometiems, but this is my first time not to update cluster topology.

rueian · 2024-07-21T12:06:33Z

Thanks @proost! Would you also like to try the v1.0.42-alpha? And how old is your Elasticache cluster? I found that a newly created Elasticache forms the cluster with IP addresses instead of domain names. It is even impossible for me to get a domain name resolution error.

proost · 2024-07-21T15:30:35Z

@rueian

how old is your Elasticache cluster

do you mean version of redis or operating time?

rueian · 2024-07-21T16:00:08Z

@rueian

how old is your Elasticache cluster

do you mean version of redis or operating time?

Maybe both. I think it is possible that clusters have differences even on a same redis version but created on different date.

proost · 2024-07-22T07:14:38Z

@rueian
I ran the redis cluster about 13 months. About 4weeks ago, i scale down the cluster.

I use redis 7.0.7 version

ajzach · 2024-07-22T12:36:16Z

Hi @ajzach,

Rueidis sends CLUSTER SHARDS to the configuration endpoint to get the latest cluster topology whenever an error occurs or a Redis MOVED response is received. Based on your description, the mechanism was still working since those domain name errors disappeared when you added new nodes. So I think it was more likely that your configuration endpoint kept giving you stale topology until a new node was added.

The current rueidis always gets the latest cluster topology from the configuration endpoint only, but it seems not to work well with your Elasticache cluster. Would you like to try the new v1.0.42-alpha? The new version will send CLUSTER SHARDS to a known node randomly. I think it will reduce the chance of getting stale information.

At the moment I was experiencing errors, I accessed the application instance and queried the nodes directly at the endpoint; the nodes that had been removed were not listed.

rueian · 2024-07-22T12:40:14Z

Hi @ajzach, that was probably because at that time the configuration endpoint resolved to a relatively new node while rueidis kept an old connection to an old node.

rueian · 2024-07-23T11:44:37Z

Hi @ajzach, the v1.0.42-alpha should reduce the chance of getting stale information from an old connection. Please let me know if you have tried it.

ajzach · 2024-07-23T13:34:21Z

I remember that the specialization in the AWS configuration was done for a reason. What are the implications of removing it?

rueian · 2024-07-23T14:06:01Z

The configuration endpoint is essentially a DNS alias to all nodes according to an AWS Redis team member. The specialization was made before I knew that fact and with a wrong assumption that the endpoint was a special program that was responsible for cluster topology. So the specialization is actually meaningless.

dntam00 · 2024-08-23T03:29:53Z

Hi @rueian,
Yesterday, I tested redis cluster failover in aws, the topology is 1 shard master node + 2 replicas, there are 2 cases:

success
failed

in failed case, it was strange behavior based on what I've captured:

before failover: client send request to replica node, I think it should not be like this, because client is not allowed to read/write to replica node, it should receive moved response.
after failover: client receives moved response because previous master node has became replica node, but there is no cluster shards command is sent to redis after that.

I think rueidis refreshes connections each time it receives a moved response, but currently I do not see it (by using tcpdump). I'm using rueidis v1.0.37. Client receives moved from replica node without sending cluster shards to redis, this phenomenon last about 10 hours and still keep continue, to make client work again, I have to restart it.

Does it relate to this patch https://github.com/redis/rueidis/releases/tag/v1.0.42-alpha?
Thank you very much.

rueian · 2024-08-23T04:07:08Z

Hi @dangngoctam00,

That looks very weird. Have you ever seen the cluster shards command been sent after initialization?

dntam00 · 2024-08-23T04:10:45Z

Hi @rueian , I've not checked it in failed case, maybe I need to test it again, but based on tcpdump capture, in 3 minutes until failover, there is no cluster shards command.
About success case, cluster shards is sent after failover process.

proost · 2024-08-23T15:10:33Z

@dangngoctam00 could you update rueidis version to equal or above than 1.0.42? In 1.0.42, Including 1.0.42-alpha commits and also adding rueidis client change the connection immediately before connection is refreshed when MOVED response received.

Separtely from version up, Not sending cluster shards command is very weird.

proost · 2024-08-23T15:45:23Z

@dangngoctam00 could you try again after bump up rueidis version?

dntam00 · 2024-08-23T17:32:50Z

Hi @proost , I will upgrade version to 1.0.43 and try again next Monday, I'll reply in this thread.
Anyway, I still want to understand why client got stuck and kept receive moved response, but I couldn't reproduce at local with aws alias host.
Thank you!

dntam00 · 2024-08-24T18:53:47Z

Hi @rueian , @proost ,
maybe there is a blocking in function _refresh of cluster client at line

rueidis/cluster.go

Line 220 in b514a56

result = <-results

.
When there are multiple connections and there is only 1 aws endpoint, function getClusterSlots will be executed only one time and there is no message in channel results.
It leads to this line

rueidis/singleflight.go

Line 58 in b514a56

c.ch = nil

will not be executed, so the following times of LazyDo will be ignored.
Could you review it? Thank you.

rueian · 2024-08-25T00:25:52Z

Hi @dangngoctam00,

Thank you for looking into the details. If that was the case then versions after v1.0.42 should have fixed it.

I also dropped the v1.0.45-alpha.2 which adds timeout on the getClusterSlots to avoid it from being stuck. Could you give it a try?

dntam00 · 2024-08-25T03:48:25Z

Hi @rueian , I've tried with v1.0.45-alpha.2 and client refreshes redis cluster topology normally.
FYI: when testing, I've also found this bug: [BUG] CLUSTER SHARDS command returns "empty array" in slots section, so I also think it's good to send cluster shards randomly to cluster nodes like you said before.
Thank you very much.

rueian mentioned this issue Aug 25, 2024

fix: set timeout on refreshing cluster slots to avoid being stuck #615

Merged

rueian closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error updating the cluster architecture #587

Error updating the cluster architecture #587

ajzach commented Jul 18, 2024 •

edited

Loading

rueian commented Jul 19, 2024

ajzach commented Jul 20, 2024

proost commented Jul 20, 2024

rueian commented Jul 20, 2024 •

edited

Loading

ajzach commented Jul 20, 2024

rueian commented Jul 21, 2024

proost commented Jul 21, 2024

rueian commented Jul 21, 2024

proost commented Jul 21, 2024 •

edited

Loading

rueian commented Jul 21, 2024

proost commented Jul 22, 2024

ajzach commented Jul 22, 2024

rueian commented Jul 22, 2024

rueian commented Jul 23, 2024

ajzach commented Jul 23, 2024

rueian commented Jul 23, 2024

dntam00 commented Aug 23, 2024 •

edited

Loading

rueian commented Aug 23, 2024

dntam00 commented Aug 23, 2024 •

edited

Loading

proost commented Aug 23, 2024 •

edited

Loading

proost commented Aug 23, 2024

dntam00 commented Aug 23, 2024 •

edited

Loading

dntam00 commented Aug 24, 2024 •

edited

Loading

rueian commented Aug 25, 2024 •

edited

Loading

dntam00 commented Aug 25, 2024 •

edited

Loading

Error updating the cluster architecture #587

Error updating the cluster architecture #587

Comments

ajzach commented Jul 18, 2024 • edited Loading

rueian commented Jul 19, 2024

ajzach commented Jul 20, 2024

proost commented Jul 20, 2024

rueian commented Jul 20, 2024 • edited Loading

ajzach commented Jul 20, 2024

rueian commented Jul 21, 2024

proost commented Jul 21, 2024

rueian commented Jul 21, 2024

proost commented Jul 21, 2024 • edited Loading

rueian commented Jul 21, 2024

proost commented Jul 22, 2024

ajzach commented Jul 22, 2024

rueian commented Jul 22, 2024

rueian commented Jul 23, 2024

ajzach commented Jul 23, 2024

rueian commented Jul 23, 2024

dntam00 commented Aug 23, 2024 • edited Loading

rueian commented Aug 23, 2024

dntam00 commented Aug 23, 2024 • edited Loading

proost commented Aug 23, 2024 • edited Loading

proost commented Aug 23, 2024

dntam00 commented Aug 23, 2024 • edited Loading

dntam00 commented Aug 24, 2024 • edited Loading

rueian commented Aug 25, 2024 • edited Loading

dntam00 commented Aug 25, 2024 • edited Loading

ajzach commented Jul 18, 2024 •

edited

Loading

rueian commented Jul 20, 2024 •

edited

Loading

proost commented Jul 21, 2024 •

edited

Loading

dntam00 commented Aug 23, 2024 •

edited

Loading

dntam00 commented Aug 23, 2024 •

edited

Loading

proost commented Aug 23, 2024 •

edited

Loading

dntam00 commented Aug 23, 2024 •

edited

Loading

dntam00 commented Aug 24, 2024 •

edited

Loading

rueian commented Aug 25, 2024 •

edited

Loading

dntam00 commented Aug 25, 2024 •

edited

Loading