-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks Stuck at Scheduled State #40106
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
I'm having the same issue. Airfow tasks stuck after a few days of running. I handled it by restarting scheduler pods. |
Could you please provide more details for reproducing the issue? (e.g., Airflow settings, what operators are being used in the tasks, versions of operators, etc.) |
I upgraded from 2.7.3 directly to 2.9.1 and this issue showed up. We did not catch it during testing because it only happened after running the scheduler for some time. Today I tested it more in our dev cluster by running more tasks, and the same issue showed up. We use KubernetesExecutor, and mostly Python and PythonVirtualEnvOperator. Meta database: PostgresDB 15.6 |
Assuming you haven't installed a different version of the k8s provider, you're running on 8.1.1. @hbc-acai @bangnh1 do either of you see pods stacking and not being deleted up in k8s? If you are exporting metrics to statsd/prometheus, what do you see for |
@jedcunningham Yes, I do see more than 10 more pods show in READY=0/1 and Status=complete states when I run Sorry, how can I export metrics to statsd/prometheus? Also, what is the correct why to check the k8s provider version? It is strange that I do not see a k8s provider in the package list for my scheduler:
|
It's this one, For metrics info, see https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html. I suspect you might be hitting #36998, and |
Thanks @jedcunningham. I upgraded |
As no further action seems required, I'll close this issue. |
We have the same issue. |
I think it must be some kind of race condition or missing error condition handling. I wonder if this one is involved with some cases where people manually kill tasks for example from the UI ? (this is a totally wild guess - I base it on the fact that it happens pretty often but not really always, so there might be some unusual thing to trigger it). |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
We just upgraded from 2.10.2 to 2.10.4, and I think the issue is fixed. thanks |
Apache Airflow version
2.9.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
After upgrading to 2.9.1, we find tasks are stuck at scheduled state after about 1 hour scheduler started. During the first hour, all tasks are running fine. Then I restarted the scheduler, and it successfully moved the "stuck" task instances to queued state and then run them. But new tasks got stuck again after one hour.
This is reproduceable in my production cluster. It happens every time after we restart our scheduler. But we are not able to replicate this in our dev cluster.
There are no errors in the scheduler log. Here is some logs where things went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran successufully, but task got stuck in the scheduled state. In the below log I only found information about the 2 tasks (database_setup, positions_extract ) that ran successfully.
What you think should happen instead?
No response
How to reproduce
I can easily reproduce it in my production cluster. But I cannot reproduce it in our dev cluster. Both clusters have almost exactly the same setup.
Operating System
Azure Kubernetes Service
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Using apache-airflow:2.9.1-python3.10 image
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: