Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapt increasing gap in retry policy #3

Open
hcfw007 opened this issue Apr 26, 2022 · 7 comments
Open

adapt increasing gap in retry policy #3

hcfw007 opened this issue Apr 26, 2022 · 7 comments
Labels
question Further information is requested

Comments

@hcfw007
Copy link
Member

hcfw007 commented Apr 26, 2022

Current resolve policy will retry 3 times with a 15s gap, the time for the token to resume is too short.
I suggest we add more retries with increasing time gap e.g. 5s, 8s, 13s, 21s, 34s, like a fibonacci sequence.
Since for now a token in a error state (e.g. restarting) is the same to discover service as an invalid token, it is necessary to allow more times for the token to recover.

@huan
Copy link
Member

huan commented Apr 26, 2022

Could we reduce the token service recovery time, instead of increasing the token error retry timeout?

i.e. Is there any way to make the token service recover instantly when it has to be restarted?

@huan huan added the question Further information is requested label Apr 26, 2022
@hcfw007
Copy link
Member Author

hcfw007 commented Apr 26, 2022

Actually for most cases the token will recover in time. But there is exceptions. When we talk about this problem, @windmemory believes we should allow about 5 minutes recover time.

@huan
Copy link
Member

huan commented Apr 26, 2022

Actually for most cases the token will recover in time.

Good to know that!

5 minutes recover time.

So we are talking about an edge case of the timeout, which I believe it should be thrown as an error as soon as possible?

@hcfw007
Copy link
Member Author

hcfw007 commented Apr 26, 2022

I'd like to introduce a typical scenario:
When we update the puppet in a server. We need to pull the image and recreate the container. In this process the token will be down for about 3-5 minutes depending on network mostly. So if that will not cause bot error that would be great.

@huan
Copy link
Member

huan commented Apr 26, 2022

When we update the puppet in a server. We need to pull the image and recreate the container. In this process the token will be down for about 3-5 minutes depending on network mostly.

Thanks for sharing your use case!

I felt a Deja Vu for this process ... @lijiarui

Can we start the updated container first before we shutdown the old one, so that we can switch to the new one right after we stop the old one?

@hcfw007
Copy link
Member Author

hcfw007 commented Apr 27, 2022

It's really hard to time how long should we stop the old container after the new one started. The time between starting a wecom client can be as short as 10s and as long as. 60s. If two wecom instances for one account run at the same time, there might be some weird problems.
Further more, we need close the old wecom client so that the temp files will not change when starting the new client. Sometimes temp files mismatch will cause the account to logout.

@hcfw007
Copy link
Member Author

hcfw007 commented Apr 27, 2022

This also relate to the problem: should we retry when we get 404 response? I think we should at least retry for a short while, since as I mentioned before, upgrade server version will cause 404 in discover-service for some time/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants