Tasks Stuck at Scheduled State #40106

hbc-acai · 2024-06-07T03:03:42Z

Apache Airflow version

2.9.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

After upgrading to 2.9.1, we find tasks are stuck at scheduled state after about 1 hour scheduler started. During the first hour, all tasks are running fine. Then I restarted the scheduler, and it successfully moved the "stuck" task instances to queued state and then run them. But new tasks got stuck again after one hour.

This is reproduceable in my production cluster. It happens every time after we restart our scheduler. But we are not able to replicate this in our dev cluster.

There are no errors in the scheduler log. Here is some logs where things went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran successufully, but task got stuck in the scheduled state. In the below log I only found information about the 2 tasks (database_setup, positions_extract ) that ran successfully.

[2024-06-07T01:12:52.113+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:13:52.199+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:14:52.284+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:417} INFO - 2 tasks up for execution:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 0/16 running and queued tasks
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 1/16 running and queued tasks
[2024-06-07T01:15:37.977+0000] {scheduler_job_runner.py:596} INFO - Setting the following tasks to queued state:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='database_setup', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.980+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'database_setup', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='positions_extract', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.981+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'positions_extract', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']

What you think should happen instead?

No response

How to reproduce

I can easily reproduce it in my production cluster. But I cannot reproduce it in our dev cluster. Both clusters have almost exactly the same setup.

Operating System

Azure Kubernetes Service

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Using apache-airflow:2.9.1-python3.10 image

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2024-06-07T03:03:45Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

bangnh1 · 2024-06-07T07:08:37Z

I'm having the same issue. Airfow tasks stuck after a few days of running. I handled it by restarting scheduler pods.

shahar1 · 2024-06-07T07:42:34Z

Could you please provide more details for reproducing the issue? (e.g., Airflow settings, what operators are being used in the tasks, versions of operators, etc.)
More details are needed so someone can reproduce and fix the issue themselves.

hbc-acai · 2024-06-07T15:08:42Z

I upgraded from 2.7.3 directly to 2.9.1 and this issue showed up. We did not catch it during testing because it only happened after running the scheduler for some time. Today I tested it more in our dev cluster by running more tasks, and the same issue showed up.

We use KubernetesExecutor, and mostly Python and PythonVirtualEnvOperator.

Meta database: PostgresDB 15.6
DAGs and logs Persistent Volumes: Azure Blob Storage
Single Scheduler

jedcunningham · 2024-06-07T17:27:41Z

Assuming you haven't installed a different version of the k8s provider, you're running on 8.1.1.

@hbc-acai @bangnh1 do either of you see pods stacking and not being deleted up in k8s? If you are exporting metrics to statsd/prometheus, what do you see for executor.open_slots and executor.running_tasks during that time period?

hbc-acai · 2024-06-07T17:49:38Z

@jedcunningham Yes, I do see more than 10 more pods show in READY=0/1 and Status=complete states when I run kubectl get pods -n airflow.

Sorry, how can I export metrics to statsd/prometheus?

Also, what is the correct why to check the k8s provider version? It is strange that I do not see a k8s provider in the package list for my scheduler:

airflow@airflow-scheduler-79b6db7fd8-x77mk:/opt/airflow$ pip freeze
adal==1.2.7
adlfs==2024.4.1
aiobotocore==2.12.3
aiofiles==23.2.1
aiohttp==3.9.5
aioitertools==0.11.0
aioodbc==0.4.1
aiosignal==1.3.1
alembic==1.13.1
amqp==5.2.0
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.9.1
apache-airflow-providers-amazon==8.20.0
apache-airflow-providers-celery==3.6.2
apache-airflow-providers-cncf-kubernetes==8.1.1
apache-airflow-providers-common-io==1.3.1
apache-airflow-providers-common-sql==1.12.0
apache-airflow-providers-docker==3.10.0
apache-airflow-providers-elasticsearch==5.3.4
apache-airflow-providers-fab==1.0.4
apache-airflow-providers-ftp==3.8.0
apache-airflow-providers-google==10.17.0
apache-airflow-providers-grpc==3.4.1
apache-airflow-providers-hashicorp==3.6.4
apache-airflow-providers-http==4.10.1
apache-airflow-providers-imap==3.5.0
apache-airflow-providers-microsoft-azure==10.0.0
apache-airflow-providers-mysql==5.5.4
apache-airflow-providers-odbc==4.5.0
apache-airflow-providers-openlineage==1.7.0
apache-airflow-providers-postgres==5.10.2
apache-airflow-providers-redis==3.6.1
apache-airflow-providers-sendgrid==3.4.0
apache-airflow-providers-sftp==4.9.1
apache-airflow-providers-slack==8.6.2
apache-airflow-providers-smtp==1.6.1
apache-airflow-providers-snowflake==5.4.0
apache-airflow-providers-sqlite==3.7.1
apache-airflow-providers-ssh==3.10.1
apispec==6.6.1

jedcunningham · 2024-06-07T18:29:03Z

It's this one, apache-airflow-providers-cncf-kubernetes==8.1.1, sorry, I tend to just refer to it as the "k8s provider", but its actually named "CNCF Kubernetes".

For metrics info, see https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html.

I suspect you might be hitting #36998, and 8.3.0 from #39551 might fix it? Can you try upgrading to that version?

hbc-acai · 2024-06-08T02:11:54Z

Thanks @jedcunningham. I upgraded apache-airflow-providers-cncf-kubernetes==8.3.0 and I have not seen the same error in the last few hours.

shahar1 · 2024-06-08T14:02:24Z

As no further action seems required, I'll close this issue.
If you encounter this behavior again, please create another issue.

romsharon98 · 2024-10-21T14:24:10Z

We have the same issue.
I looked at the executor_open_slots and saw that when we have a spike this metric show almost 0 open slots.
I wonder why is that because we have 4 schedulers and AIRFLOW__CORE__PARALLELISEM equals to 128.
Is there different configuration for executor slots? and does executor reffere also to scheduler?
I see also in executor_running_tasks that we go until 512 (4 * 128)..so as I understand executor_open_slots this metrics isn't correct and shows only one scheduler.

potiuk · 2024-10-31T11:22:10Z

I think it must be some kind of race condition or missing error condition handling. I wonder if this one is involved with some cases where people manually kill tasks for example from the UI ? (this is a totally wild guess - I base it on the fact that it happens pretty often but not really always, so there might be some unusual thing to trigger it).

potiuk · 2024-11-05T16:24:00Z

Airflow 2.10.3 is now out an it has fix #42932 that is likely to fix the problems you reported, please upgrade, check if it fixed your problem and report back @hbc-acai ?

github-actions · 2024-11-27T00:16:16Z

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions · 2024-12-04T00:16:42Z

This issue has been closed because it has not received response from the issue author.

hbc-acai · 2025-01-17T21:31:10Z

We just upgraded from 2.10.2 to 2.10.4, and I think the issue is fixed. thanks

hbc-acai added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jun 7, 2024

shahar1 added area:Scheduler including HA (high availability) scheduler area:core and removed area:core labels Jun 7, 2024

shahar1 added the pending-response label Jun 7, 2024

shahar1 removed the pending-response label Jun 7, 2024

shahar1 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2024

shahar1 added duplicate Issue that is duplicated and removed needs-triage label for new issues that we didn't triage yet labels Jun 8, 2024

mikaeld mentioned this issue Oct 7, 2024

fix(providers): Upgrade to kubernetes provider 8.3.4. mozilla/telemetry-airflow#2108

Merged

romsharon98 reopened this Oct 21, 2024

shahar1 added area:providers provider:cncf-kubernetes Kubernetes provider related issues and removed duplicate Issue that is duplicated labels Oct 23, 2024

potiuk added the pending-response label Nov 5, 2024

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 27, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks Stuck at Scheduled State #40106

Tasks Stuck at Scheduled State #40106

hbc-acai commented Jun 7, 2024

boring-cyborg bot commented Jun 7, 2024

bangnh1 commented Jun 7, 2024

shahar1 commented Jun 7, 2024

hbc-acai commented Jun 7, 2024 •

edited

Loading

jedcunningham commented Jun 7, 2024

hbc-acai commented Jun 7, 2024 •

edited

Loading

jedcunningham commented Jun 7, 2024

hbc-acai commented Jun 8, 2024

shahar1 commented Jun 8, 2024

romsharon98 commented Oct 21, 2024 •

edited

Loading

potiuk commented Oct 31, 2024

potiuk commented Nov 5, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Dec 4, 2024

hbc-acai commented Jan 17, 2025

Tasks Stuck at Scheduled State #40106

Tasks Stuck at Scheduled State #40106

Comments

hbc-acai commented Jun 7, 2024

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Jun 7, 2024

bangnh1 commented Jun 7, 2024

shahar1 commented Jun 7, 2024

hbc-acai commented Jun 7, 2024 • edited Loading

jedcunningham commented Jun 7, 2024

hbc-acai commented Jun 7, 2024 • edited Loading

jedcunningham commented Jun 7, 2024

hbc-acai commented Jun 8, 2024

shahar1 commented Jun 8, 2024

romsharon98 commented Oct 21, 2024 • edited Loading

potiuk commented Oct 31, 2024

potiuk commented Nov 5, 2024

github-actions bot commented Nov 27, 2024

github-actions bot commented Dec 4, 2024

hbc-acai commented Jan 17, 2025

hbc-acai commented Jun 7, 2024 •

edited

Loading

hbc-acai commented Jun 7, 2024 •

edited

Loading

romsharon98 commented Oct 21, 2024 •

edited

Loading