adapt increasing gap in retry policy #3

hcfw007 · 2022-04-26T09:35:10Z

Current resolve policy will retry 3 times with a 15s gap, the time for the token to resume is too short.
I suggest we add more retries with increasing time gap e.g. 5s, 8s, 13s, 21s, 34s, like a fibonacci sequence.
Since for now a token in a error state (e.g. restarting) is the same to discover service as an invalid token, it is necessary to allow more times for the token to recover.

huan · 2022-04-26T09:52:40Z

Could we reduce the token service recovery time, instead of increasing the token error retry timeout?

i.e. Is there any way to make the token service recover instantly when it has to be restarted?

hcfw007 · 2022-04-26T10:02:22Z

Actually for most cases the token will recover in time. But there is exceptions. When we talk about this problem, @windmemory believes we should allow about 5 minutes recover time.

huan · 2022-04-26T10:07:40Z

Actually for most cases the token will recover in time.

Good to know that!

5 minutes recover time.

So we are talking about an edge case of the timeout, which I believe it should be thrown as an error as soon as possible?

hcfw007 · 2022-04-26T12:11:57Z

I'd like to introduce a typical scenario:
When we update the puppet in a server. We need to pull the image and recreate the container. In this process the token will be down for about 3-5 minutes depending on network mostly. So if that will not cause bot error that would be great.

huan · 2022-04-26T12:41:31Z

When we update the puppet in a server. We need to pull the image and recreate the container. In this process the token will be down for about 3-5 minutes depending on network mostly.

Thanks for sharing your use case!

I felt a Deja Vu for this process ... @lijiarui

Can we start the updated container first before we shutdown the old one, so that we can switch to the new one right after we stop the old one?

hcfw007 · 2022-04-27T07:00:25Z

It's really hard to time how long should we stop the old container after the new one started. The time between starting a wecom client can be as short as 10s and as long as. 60s. If two wecom instances for one account run at the same time, there might be some weird problems.
Further more, we need close the old wecom client so that the temp files will not change when starting the new client. Sometimes temp files mismatch will cause the account to logout.

hcfw007 · 2022-04-27T07:43:35Z

This also relate to the problem: should we retry when we get 404 response? I think we should at least retry for a short while, since as I mentioned before, upgrade server version will cause 404 in discover-service for some time/

huan added the question Further information is requested label Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapt increasing gap in retry policy #3

adapt increasing gap in retry policy #3

hcfw007 commented Apr 26, 2022 •

edited

Loading

huan commented Apr 26, 2022 •

edited

Loading

hcfw007 commented Apr 26, 2022

huan commented Apr 26, 2022

hcfw007 commented Apr 26, 2022

huan commented Apr 26, 2022

hcfw007 commented Apr 27, 2022

hcfw007 commented Apr 27, 2022

adapt increasing gap in retry policy #3

adapt increasing gap in retry policy #3

Comments

hcfw007 commented Apr 26, 2022 • edited Loading

huan commented Apr 26, 2022 • edited Loading

hcfw007 commented Apr 26, 2022

huan commented Apr 26, 2022

hcfw007 commented Apr 26, 2022

huan commented Apr 26, 2022

hcfw007 commented Apr 27, 2022

hcfw007 commented Apr 27, 2022

hcfw007 commented Apr 26, 2022 •

edited

Loading

huan commented Apr 26, 2022 •

edited

Loading