Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks Stuck at Scheduled State #40106

Closed
1 of 2 tasks
hbc-acai opened this issue Jun 7, 2024 · 15 comments
Closed
1 of 2 tasks

Tasks Stuck at Scheduled State #40106

hbc-acai opened this issue Jun 7, 2024 · 15 comments
Labels
area:core area:providers area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response provider:cncf-kubernetes Kubernetes provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@hbc-acai
Copy link

hbc-acai commented Jun 7, 2024

Apache Airflow version

2.9.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

After upgrading to 2.9.1, we find tasks are stuck at scheduled state after about 1 hour scheduler started. During the first hour, all tasks are running fine. Then I restarted the scheduler, and it successfully moved the "stuck" task instances to queued state and then run them. But new tasks got stuck again after one hour.

This is reproduceable in my production cluster. It happens every time after we restart our scheduler. But we are not able to replicate this in our dev cluster.

There are no errors in the scheduler log. Here is some logs where things went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran successufully, but task got stuck in the scheduled state. In the below log I only found information about the 2 tasks (database_setup, positions_extract ) that ran successfully.

[2024-06-07T01:12:52.113+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:13:52.199+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:14:52.284+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:417} INFO - 2 tasks up for execution:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 0/16 running and queued tasks
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 1/16 running and queued tasks
[2024-06-07T01:15:37.977+0000] {scheduler_job_runner.py:596} INFO - Setting the following tasks to queued state:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='database_setup', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.980+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'database_setup', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='positions_extract', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.981+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'positions_extract', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']

What you think should happen instead?

No response

How to reproduce

I can easily reproduce it in my production cluster. But I cannot reproduce it in our dev cluster. Both clusters have almost exactly the same setup.

Operating System

Azure Kubernetes Service

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Using apache-airflow:2.9.1-python3.10 image

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@hbc-acai hbc-acai added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 7, 2024
Copy link

boring-cyborg bot commented Jun 7, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@bangnh1
Copy link

bangnh1 commented Jun 7, 2024

I'm having the same issue. Airfow tasks stuck after a few days of running. I handled it by restarting scheduler pods.

@shahar1 shahar1 added area:Scheduler including HA (high availability) scheduler area:core and removed area:core labels Jun 7, 2024
@shahar1
Copy link
Contributor

shahar1 commented Jun 7, 2024

Could you please provide more details for reproducing the issue? (e.g., Airflow settings, what operators are being used in the tasks, versions of operators, etc.)
More details are needed so someone can reproduce and fix the issue themselves.

@hbc-acai
Copy link
Author

hbc-acai commented Jun 7, 2024

I upgraded from 2.7.3 directly to 2.9.1 and this issue showed up. We did not catch it during testing because it only happened after running the scheduler for some time. Today I tested it more in our dev cluster by running more tasks, and the same issue showed up.

We use KubernetesExecutor, and mostly Python and PythonVirtualEnvOperator.

Meta database: PostgresDB 15.6
DAGs and logs Persistent Volumes: Azure Blob Storage
Single Scheduler

@jedcunningham
Copy link
Member

Assuming you haven't installed a different version of the k8s provider, you're running on 8.1.1.

@hbc-acai @bangnh1 do either of you see pods stacking and not being deleted up in k8s? If you are exporting metrics to statsd/prometheus, what do you see for executor.open_slots and executor.running_tasks during that time period?

@hbc-acai
Copy link
Author

hbc-acai commented Jun 7, 2024

@jedcunningham Yes, I do see more than 10 more pods show in READY=0/1 and Status=complete states when I run kubectl get pods -n airflow.

Sorry, how can I export metrics to statsd/prometheus?

Also, what is the correct why to check the k8s provider version? It is strange that I do not see a k8s provider in the package list for my scheduler:

airflow@airflow-scheduler-79b6db7fd8-x77mk:/opt/airflow$ pip freeze
adal==1.2.7
adlfs==2024.4.1
aiobotocore==2.12.3
aiofiles==23.2.1
aiohttp==3.9.5
aioitertools==0.11.0
aioodbc==0.4.1
aiosignal==1.3.1
alembic==1.13.1
amqp==5.2.0
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.9.1
apache-airflow-providers-amazon==8.20.0
apache-airflow-providers-celery==3.6.2
apache-airflow-providers-cncf-kubernetes==8.1.1
apache-airflow-providers-common-io==1.3.1
apache-airflow-providers-common-sql==1.12.0
apache-airflow-providers-docker==3.10.0
apache-airflow-providers-elasticsearch==5.3.4
apache-airflow-providers-fab==1.0.4
apache-airflow-providers-ftp==3.8.0
apache-airflow-providers-google==10.17.0
apache-airflow-providers-grpc==3.4.1
apache-airflow-providers-hashicorp==3.6.4
apache-airflow-providers-http==4.10.1
apache-airflow-providers-imap==3.5.0
apache-airflow-providers-microsoft-azure==10.0.0
apache-airflow-providers-mysql==5.5.4
apache-airflow-providers-odbc==4.5.0
apache-airflow-providers-openlineage==1.7.0
apache-airflow-providers-postgres==5.10.2
apache-airflow-providers-redis==3.6.1
apache-airflow-providers-sendgrid==3.4.0
apache-airflow-providers-sftp==4.9.1
apache-airflow-providers-slack==8.6.2
apache-airflow-providers-smtp==1.6.1
apache-airflow-providers-snowflake==5.4.0
apache-airflow-providers-sqlite==3.7.1
apache-airflow-providers-ssh==3.10.1
apispec==6.6.1

@jedcunningham
Copy link
Member

It's this one, apache-airflow-providers-cncf-kubernetes==8.1.1, sorry, I tend to just refer to it as the "k8s provider", but its actually named "CNCF Kubernetes".

For metrics info, see https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html.

I suspect you might be hitting #36998, and 8.3.0 from #39551 might fix it? Can you try upgrading to that version?

@hbc-acai
Copy link
Author

hbc-acai commented Jun 8, 2024

Thanks @jedcunningham. I upgraded apache-airflow-providers-cncf-kubernetes==8.3.0 and I have not seen the same error in the last few hours.

@shahar1
Copy link
Contributor

shahar1 commented Jun 8, 2024

As no further action seems required, I'll close this issue.
If you encounter this behavior again, please create another issue.

@shahar1 shahar1 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2024
@shahar1 shahar1 added duplicate Issue that is duplicated and removed needs-triage label for new issues that we didn't triage yet labels Jun 8, 2024
@romsharon98
Copy link
Contributor

romsharon98 commented Oct 21, 2024

We have the same issue.
I looked at the executor_open_slots and saw that when we have a spike this metric show almost 0 open slots.
I wonder why is that because we have 4 schedulers and AIRFLOW__CORE__PARALLELISEM equals to 128.
Is there different configuration for executor slots? and does executor reffere also to scheduler?
I see also in executor_running_tasks that we go until 512 (4 * 128)..so as I understand executor_open_slots this metrics isn't correct and shows only one scheduler.

@romsharon98 romsharon98 reopened this Oct 21, 2024
@shahar1 shahar1 added area:providers provider:cncf-kubernetes Kubernetes provider related issues and removed duplicate Issue that is duplicated labels Oct 23, 2024
@potiuk
Copy link
Member

potiuk commented Oct 31, 2024

I think it must be some kind of race condition or missing error condition handling. I wonder if this one is involved with some cases where people manually kill tasks for example from the UI ? (this is a totally wild guess - I base it on the fact that it happens pretty often but not really always, so there might be some unusual thing to trigger it).

@potiuk
Copy link
Member

potiuk commented Nov 5, 2024

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @hbc-acai ?

Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 27, 2024
Copy link

github-actions bot commented Dec 4, 2024

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2024
@hbc-acai
Copy link
Author

We just upgraded from 2.10.2 to 2.10.4, and I think the issue is fixed. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:providers area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response provider:cncf-kubernetes Kubernetes provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests

6 participants