-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
context deadline exceeded in worker log when starting a workflow after upgrading to 1.19.0 #3826
Comments
Is this a new namespace just created? Did the problem persist after a few minutes? |
Yes, this is a new namespace just created.
It did. |
It does succeed eventually (after ~19 minutes) but that does seem weird. It looks like something odd is going on with history and ringpop (service discovery): note that the worker locates frontend, matching, and itself right at startup, but doesn't locate history for another 12 seconds. (The worker doesn't contact history directly to start a workflow but it suggests that the frontend might also have a problem.) Then at 2023-01-19T13:46:27.123Z it seems to lose history completely, and find it again 25 seconds later, and at the next retry the scanner workflow is created successfully. Looking at the frontend log itself, it either see no history pods, or has trouble reaching the one it can see, until it succeeds. So I'd suspect something about the connectivity of the history pod. |
We're seeing this issue too, with the same characteristics:
Some differences that don't seem to be substantial are that
|
Can you provide the logs from history role? Looking at the frontend and worker logs, the worker failed to start because of the error in frontend and the frontend errors are related to the history role. |
Here are all logs for completeness. Using the timestamps in the logs, I started everything at pretty much the same time at 19:55, and then waited a few minutes for the temporal-frontend-logs.txt |
I don't have any specific theories yet, just listing a few things I'm noticing in those logs:
|
My bad, that's just the worker crash looping because the rest of the cluster is broken. But every run looks about the same.
That was a bit of laziness from when I created this helm chart for deploying temporal on k8s. I've tried it again with unique ports but the issue still manifests in the same way.
These are old, but this is a case I run into often because I'm running this locally and I frequently bring down all of my pods and bring them up quickly again for local development. However, those stale ones are unreachable with a connection refused error (I just poked those host:ports with netcat from a different pod).
WORKAROUND: I can't reproduce this if I wait 20s. And it seems like in the rare cases where I don't wait 20s before restarting but things still work, all of the pods came back with the same IPs as before. So it seems like something changed about service discovery/ringpop that broke my use case. Is there anything I can tweak to get the old behaviour back? Or is this something that can easily be tweaked from your end in the service discovery code? As far as I can tell, that |
I think that we're experiencing the same issue after upgrading to 1.19.1: |
Interesting. Can you reproduce the problem if you wait at least 20s between all services being stopped and starting them again? |
Yeah, I did manage to reproduce the issue after killing the temporal process (I have just the one) and restarting it in the same container. Edit: I did wait 20 seconds before restarting. |
I tried to repro this issue by:
The issue I found is the startup timeout is too short. I added a fix #3911. Please let me know if this fixes your issue. |
Some more info requested from slack:
The first time I saw this issue was from upgrading. But I've since gone back and reproduced it with a brand-new cluster in 1.19.1. My earlier messages about cleaning the cluster_membership table are just me trying to work around the issue. Basically, if I restart my services enough and hit this issue where they are all stuck, I can bring them all down, clear |
I build the server image from https://github.com/temporalio/docker-builds and I tested it. The easiest way I have of reproducing it is just running With the fix from @yux0, I haven't been able to reproduce the issue. In the 20 times I tried killing the pods and checking the cluster, it came back up successfully every time (verified with |
Expected Behavior
Successful workflow creation . We did not see this problem in 1.18.x.
Note that there was no data migration involved (new deployment including database)
Actual Behavior
Errors like
Steps to Reproduce the Problem
Does not happen every time but when it does, seems to persist.
We do not have a simple way to recreate this but hoping this might be a known issue given we saw it immediately after going to 1.19.0. We can upgrade to 1.19.1 if this is a know issue fixed in that version.
Attaching temporal logs...
kubectl-logs-temporal-frontend-c84645c8-wj8fr-temporal-frontend-current.txt
kubectl-logs-temporal-web-557c4795f9-46n76-temporal-web-current.txt
kubectl-logs-temporal-worker-77456bfc96-979hl-temporal-worker-current.txt
Specifications
The text was updated successfully, but these errors were encountered: