Airflow Scheduler with Kubernetes Executor has errors in logs and stuck slots with no running tasks #35426

crabio · 2023-11-01T17:10:31Z

crabio
Nov 1, 2023

Apache Airflow version

2.7.2

What happened

Airflow Scheduler has periodic errors:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 111, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 167, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)

Scheduler at start has a lot of empty slots, what we see in it's metrics (airflow_executor_running_tasks, airflow_executor_open_slots). But after some time and these errors - it has no empty slots and no tasks are scheduled. All stuck in the queued state.

After some analyze it seems confusing that in kubernetes provider v7.8.0 it has code to process it:

except ApiException as e:
            if e.status == 410:  # Resource version is too old
                if self.namespace == ALL_NAMESPACES:
                    pods = kube_client.list_pod_for_all_namespaces(watch=False)
                else:
                    pods = kube_client.list_namespaced_pod(namespace=self.namespace, watch=False)
                resource_version = pods.metadata.resource_version
                query_kwargs["resource_version"] = resource_version
                return self._pod_events(kube_client=kube_client, query_kwargs=query_kwargs)
            else:
                raise

But not clear, why we still have an error.
Also I saw that this error was a lot of times in other versions of Airflow.

What you think should happen instead

No response

How to reproduce

Create docker of Airflow with these libraries:

apache-airflow == 2.7.2
dbt-core == 1.6.6
dbt-snowflake == 1.6.4
apache-airflow-providers-snowflake
apache-airflow[statsd]
facebook-business == 16.0.2
google-ads == 21.1.0
twitter-ads == 11.0.0
acryl-datahub-airflow-plugin
acryl-datahub[dbt]
checksumdir
filelock
openpyxl
cronsim
apache-airflow-providers-cncf-kubernetes==7.8.0
kubernetes

Configure it via Community Helm chart with Kubernetes Executor
Add more than 50 tasks in 2-3 DAGs

Operating System

Docker based on apache/airflow:2.7.2

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

apache-airflow-providers-cncf-kubernetes==7.8.0

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2023-11-01T17:10:34Z

boring-cyborg[bot]
bot Nov 1, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

crabio · 2023-11-02T09:26:46Z

crabio
Nov 2, 2023
Author

Also we figured out that some time tasks failed without any reason and without retries.

[�[34m2023-11-02T03:13:05.686+0000�[0m] {�[34mtaskinstance.py:�[0m1441} WARNING�[0m - cannot record scheduled_duration for task AAA because previous state change time has not been saved�[0m
[�[34m2023-11-02T03:13:05.686+0000�[0m] {�[34mbase_executor.py:�[0m144} INFO�[0m - Adding to queue: ['airflow', 'tasks', 'run', 'AAA', 'AAA', 'scheduled__2023-11-02T02:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/dags/AAA.py']�[0m
[�[34m2023-11-02T03:13:05.686+0000�[0m] {�[34mscheduler_job_runner.py:�[0m636} INFO�[0m - Sending TaskInstanceKey(dag_id='AAA', task_id='AAA', run_id='scheduled__2023-11-02T02:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default�[0m
[�[34m2023-11-02T03:13:05.691+0000�[0m] {�[34mkubernetes_executor.py:�[0m312} INFO�[0m - Add task TaskInstanceKey(dag_id='AAA', task_id='AAA', run_id='scheduled__2023-11-02T02:00:00+00:00', try_number=1, map_index=-1) with command ['airflow', 'tasks', 'run', 'AAA', 'AAA', 'scheduled__2023-11-02T02:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/dags/AAA.py']�[0m
[�[34m2023-11-02T03:13:05.839+0000�[0m] {�[34mkubernetes_executor_utils.py:�[0m396} INFO�[0m - Creating kubernetes pod for job is TaskInstanceKey(dag_id='AAA', task_id='AAA', run_id='scheduled__2023-11-02T02:00:00+00:00', try_number=1, map_index=-1), with pod name AAA-783qadx4, annotations: <omitted>�[0m
[�[34m2023-11-02T03:13:06.120+0000�[0m] {�[34mscheduler_job_runner.py:�[0m686} INFO�[0m - Received executor event with state queued for task instance TaskInstanceKey(dag_id='AAA', task_id='AAA', run_id='scheduled__2023-11-02T02:00:00+00:00', try_number=1, map_index=-1)�[0m
[�[34m2023-11-02T03:14:23.708+0000�[0m] {�[34mkubernetes_executor.py:�[0m356} INFO�[0m - Changing state of (TaskInstanceKey(dag_id='AAA', task_id='AAA', run_id='scheduled__2023-11-02T02:00:00+00:00', try_number=1, map_index=-1), <TaskInstanceState.FAILED: 'failed'>, 'AAA-783qadx4', '520667263') to failed�[0m
[�[34m2023-11-02T03:14:23.715+0000�[0m] {�[34mkubernetes_executor.py:�[0m441} INFO�[0m - Deleted pod: TaskInstanceKey(dag_id='AAA', task_id='AAA', run_id='scheduled__2023-11-02T02:00:00+00:00', try_number=1, map_index=-1) in namespace BBB�[0m
[�[34m2023-11-02T03:14:23.725+0000�[0m] {�[34mscheduler_job_runner.py:�[0m686} INFO�[0m - Received executor event with state failed for task instance TaskInstanceKey(dag_id='AAA', task_id='AAA', run_id='scheduled__2023-11-02T02:00:00+00:00', try_number=1, map_index=-1)�[0m
[�[34m2023-11-02T03:14:23.787+0000�[0m] {�[34mscheduler_job_runner.py:�[0m723} INFO�[0m - TaskInstance Finished: dag_id=AAA, task_id=AAA, run_id=scheduled__2023-11-02T02:00:00+00:00, map_index=-1, run_start_date=None, run_end_date=None, run_duration=None, state=queued, executor_state=failed, try_number=1, max_tries=0, job_id=None, pool=pool, queue=default, priority_weight=1, operator=CustomOperator, queued_dttm=2023-11-02 03:13:05.681227+00:00, queued_by_job_id=166781, pid=None�[0m

But in next run this task succeed.

0 replies

jscheffl · 2023-11-02T19:21:56Z

jscheffl
Nov 2, 2023
Collaborator

Your descriptions are a bit vague but I can say that the scheduler does not create empty slots in advance with KubernetesExecutor. For each task being executed, one POD is launched by the scheduler.
If the scheduler received an HTTP 410 error I feel something is broken in your K8s setup or POD template. Can you check for the PODs started if they fail or do they publish specific events in K8s?

Are there any tasks starting at all or are all tasks failing? If some are running and some not, is it sporadic or certain tasks fail and other run in success? Any pattern observable?

0 replies

crabio · 2023-11-03T18:17:05Z

crabio
Nov 3, 2023
Author

In my case I see that Airflow Scheduler after some time not queuing tasks. But we have empty slots in pool and all components are running.

No pods are spawned. All tasks are failed or skipped without attempts.

After restart Airflow Scheduling pods - all working fine for some time.

We have this only on env with a 200+ DAGs.

0 replies

crabio · 2023-11-03T18:19:13Z

crabio
Nov 3, 2023
Author

Maybe it is same issue:

Some times tasks is running well after Scheduling restart. But random task can fail without any logs or attempts.
If I clear it state via UI - task processes correctly.

This thing started happening after switch from Celery to KubernetesExecutor

0 replies

jscheffl · 2023-11-03T21:50:39Z

jscheffl
Nov 3, 2023
Collaborator

This rather sounds like troubleshooting.
I have never used KubernetesExecutor myself, so unfortunately I have not further experience and would assume that Scheduler might contain errors in the log. I assume you have checked the Scheduler logs?
Converting to a discussion for troubleshooting.

0 replies

crabio · 2023-11-03T22:40:04Z

crabio
Nov 3, 2023
Author

Why it is discussion?

0 replies

crabio · 2023-11-03T22:41:01Z

crabio
Nov 3, 2023
Author

I described my issue, that all worked fine with Airflow Celery Worker, but not with Kubernetes Executor.

I wrote that nothing happened in the Sheduller logs

0 replies

potiuk · 2023-11-03T23:37:54Z

potiuk
Nov 3, 2023
Collaborator

Why do you think it changes anything from your perspective whether it is an issue or discussion @crabio ? What's your expectation here? For me discussion means that - if ther will be more details, and it will be clearly a bug in airflow that allows someone to reproduce the error -we might classify it as an issue in Airlfow. if it is unclear - it is a discussion. In both cases, it's realy up to the author to provide enough evidences that will allow people (who help others here in their free time} to help the author with their troubleshoiting. In case of an issue in airflow, they might even be able to fix it for everyone.

What's your expectatiom actualy (for the software that you get absolutely for free without any guarantees|? Why do you think it matters for you to have it as an issue? What do you think will happen differently?

1 reply

crabio Nov 4, 2023
Author

In this case I agree with you, if it is not clear description from my side :)

crabio · 2023-11-04T14:29:07Z

crabio
Nov 4, 2023
Author

I'll try to increase log level to DEBUG on all Airflow components and try to catch some details about issue

0 replies

crabio · 2023-11-04T14:50:28Z

crabio
Nov 4, 2023
Author

After some analyze I found that Airflow scheduler has periodical error without massive load (5 every hour DAGs)

And this is logs dump:

"@timestamp",log
"Nov 4, 2023 @ 14:33:06.979","[2023-11-04T13:33:06.979+0000] {retries.py:93} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3"
"Nov 4, 2023 @ 14:33:06.861","[�[34m2023-11-04T13:33:06.861+0000�[0m] {�[34mscheduler_job_runner.py:�[0m994} DEBUG�[0m - Ran scheduling loop in 0.04 seconds�[0m"
"Nov 4, 2023 @ 14:33:06.861","[�[34m2023-11-04T13:33:06.860+0000�[0m] {�[34mscheduler_job_runner.py:�[0m992} DEBUG�[0m - Next timed event is in 0.371315�[0m"
"Nov 4, 2023 @ 14:33:06.850","[�[34m2023-11-04T13:33:06.850+0000�[0m] {�[34mmanager.py:�[0m276} DEBUG�[0m - Received message of type DagParsingStat�[0m"
"Nov 4, 2023 @ 14:33:06.850","[�[34m2023-11-04T13:33:06.849+0000�[0m] {�[34mkubernetes_executor.py:�[0m420} DEBUG�[0m - Next timed event is in 22.421525�[0m"
"Nov 4, 2023 @ 14:33:06.850","[�[34m2023-11-04T13:33:06.850+0000�[0m] {�[34mmanager.py:�[0m276} DEBUG�[0m - Received message of type DagParsingStat�[0m"
"Nov 4, 2023 @ 14:33:06.849","[�[34m2023-11-04T13:33:06.848+0000�[0m] {�[34mbase_executor.py:�[0m236} DEBUG�[0m - Calling the <class 'airflow.providers.cncf.kubernetes.executors.kubernetes_executor.KubernetesExecutor'> sync method�[0m"
"Nov 4, 2023 @ 14:33:06.849","[�[34m2023-11-04T13:33:06.849+0000�[0m] {�[34mkubernetes_executor_utils.py:�[0m442} DEBUG�[0m - Syncing KubernetesExecutor�[0m"
"Nov 4, 2023 @ 14:33:06.849","[�[34m2023-11-04T13:33:06.849+0000�[0m] {�[34mkubernetes_executor_utils.py:�[0m350} DEBUG�[0m - KubeJobWatcher for namespace dwh-airflow-dev alive, continuing�[0m"
"Nov 4, 2023 @ 14:33:06.848","[�[34m2023-11-04T13:33:06.848+0000�[0m] {�[34mbase_executor.py:�[0m217} DEBUG�[0m - 32 open slots�[0m"
"Nov 4, 2023 @ 14:33:06.848","[�[34m2023-11-04T13:33:06.848+0000�[0m] {�[34mbase_executor.py:�[0m216} DEBUG�[0m - 0 in queue�[0m"
"Nov 4, 2023 @ 14:33:06.848","[�[34m2023-11-04T13:33:06.848+0000�[0m] {�[34mbase_executor.py:�[0m215} DEBUG�[0m - 0 running task instances�[0m"
"Nov 4, 2023 @ 14:33:06.845","[�[34m2023-11-04T13:33:06.845+0000�[0m] {�[34mscheduler_job_runner.py:�[0m409} DEBUG�[0m - No tasks to consider for execution.�[0m"
"Nov 4, 2023 @ 14:33:06.833","[�[34m2023-11-04T13:33:06.833+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._schedule_all_dag_runs with retries. Try 1 of 3�[0m"
"Nov 4, 2023 @ 14:33:06.830","[�[34m2023-11-04T13:33:06.830+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3�[0m"
"Nov 4, 2023 @ 14:33:06.824","[�[34m2023-11-04T13:33:06.824+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3�[0m"
"Nov 4, 2023 @ 14:33:06.819","[2023-11-04T13:33:06.819+0000] {retries.py:93} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3"
"Nov 4, 2023 @ 14:33:06.816","[�[34m2023-11-04T13:33:06.816+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._create_dagruns_for_dags with retries. Try 1 of 3�[0m"
"Nov 4, 2023 @ 14:33:06.471","Process KubernetesJobWatcher-24:"
"Nov 4, 2023 @ 14:33:06.471","kubernetes.client.exceptions.ApiException: (410)"
"Nov 4, 2023 @ 14:33:06.471","
"
"Nov 4, 2023 @ 14:33:06.471","Traceback (most recent call last):
  File ""/usr/local/lib/python3.8/multiprocessing/process.py"", line 315, in _bootstrap
    self.run()
  File ""/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py"", line 111, in run
    self.resource_version = self._run(
  File ""/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py"", line 167, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File ""/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py"", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)"
"Nov 4, 2023 @ 14:33:06.471","Reason: Expired: too old resource version: 544496983 (544570223)"
"Nov 4, 2023 @ 14:33:06.468","[�[34m2023-11-04T13:33:06.467+0000�[0m] {�[34mkubernetes_executor_utils.py:�[0m120} ERROR�[0m - Unknown error in KubernetesJobWatcher. Failing�[0m"
"Nov 4, 2023 @ 14:33:06.468","Traceback (most recent call last):
  File ""/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py"", line 111, in run
    self.resource_version = self._run(
  File ""/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py"", line 167, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File ""/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py"", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)"
"Nov 4, 2023 @ 14:33:06.468","kubernetes.client.exceptions.ApiException: (410)"
"Nov 4, 2023 @ 14:33:06.468","Reason: Expired: too old resource version: 544496983 (544570223)"
"Nov 4, 2023 @ 14:33:06.468","
"
"Nov 4, 2023 @ 14:33:06.084","[�[34m2023-11-04T13:33:06.084+0000�[0m] {�[34mscheduler_job_runner.py:�[0m994} DEBUG�[0m - Ran scheduling loop in 0.06 seconds�[0m"
"Nov 4, 2023 @ 14:33:06.084","[�[34m2023-11-04T13:33:06.084+0000�[0m] {�[34mscheduler_job_runner.py:�[0m992} DEBUG�[0m - Next timed event is in 0.983826�[0m"
"Nov 4, 2023 @ 14:33:06.084","[2023-11-04T13:33:06.084+0000] {retries.py:93} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3"
"Nov 4, 2023 @ 14:33:06.076","[�[34m2023-11-04T13:33:06.075+0000�[0m] {�[34mjob.py:�[0m216} DEBUG�[0m - [heartbeat]�[0m"
"Nov 4, 2023 @ 14:33:06.058","[�[34m2023-11-04T13:33:06.058+0000�[0m] {�[34mmanager.py:�[0m276} DEBUG�[0m - Received message of type DagParsingStat�[0m"
"Nov 4, 2023 @ 14:33:06.058","[�[34m2023-11-04T13:33:06.058+0000�[0m] {�[34mmanager.py:�[0m276} DEBUG�[0m - Received message of type DagParsingStat�[0m"
"Nov 4, 2023 @ 14:33:06.058","[�[34m2023-11-04T13:33:06.057+0000�[0m] {�[34mkubernetes_executor.py:�[0m420} DEBUG�[0m - Next timed event is in 17.967473�[0m"
"Nov 4, 2023 @ 14:33:06.057","[�[34m2023-11-04T13:33:06.056+0000�[0m] {�[34mbase_executor.py:�[0m236} DEBUG�[0m - Calling the <class 'airflow.providers.cncf.kubernetes.executors.kubernetes_executor.KubernetesExecutor'> sync method�[0m"
"Nov 4, 2023 @ 14:33:06.057","[�[34m2023-11-04T13:33:06.057+0000�[0m] {�[34mkubernetes_executor_utils.py:�[0m442} DEBUG�[0m - Syncing KubernetesExecutor�[0m"
"Nov 4, 2023 @ 14:33:06.057","[�[34m2023-11-04T13:33:06.057+0000�[0m] {�[34mkubernetes_executor_utils.py:�[0m350} DEBUG�[0m - KubeJobWatcher for namespace dwh-airflow-dev alive, continuing�[0m"
"Nov 4, 2023 @ 14:33:06.056","[�[34m2023-11-04T13:33:06.056+0000�[0m] {�[34mbase_executor.py:�[0m215} DEBUG�[0m - 0 running task instances�[0m"
"Nov 4, 2023 @ 14:33:06.056","[�[34m2023-11-04T13:33:06.056+0000�[0m] {�[34mbase_executor.py:�[0m217} DEBUG�[0m - 32 open slots�[0m"
"Nov 4, 2023 @ 14:33:06.056","[�[34m2023-11-04T13:33:06.056+0000�[0m] {�[34mbase_executor.py:�[0m216} DEBUG�[0m - 0 in queue�[0m"
"Nov 4, 2023 @ 14:33:06.053","[�[34m2023-11-04T13:33:06.053+0000�[0m] {�[34mscheduler_job_runner.py:�[0m409} DEBUG�[0m - No tasks to consider for execution.�[0m"
"Nov 4, 2023 @ 14:33:06.041","[�[34m2023-11-04T13:33:06.041+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._schedule_all_dag_runs with retries. Try 1 of 3�[0m"
"Nov 4, 2023 @ 14:33:06.038","[�[34m2023-11-04T13:33:06.037+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3�[0m"
"Nov 4, 2023 @ 14:33:06.031","[�[34m2023-11-04T13:33:06.031+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._get_next_dagruns_to_examine with retries. Try 1 of 3�[0m"
"Nov 4, 2023 @ 14:33:06.026","[2023-11-04T13:33:06.026+0000] {retries.py:93} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3"
"Nov 4, 2023 @ 14:33:06.024","[�[34m2023-11-04T13:33:06.024+0000�[0m] {�[34mretries.py:�[0m93} DEBUG�[0m - Running SchedulerJobRunner._create_dagruns_for_dags with retries. Try 1 of 3�[0m"

Seems like Airflow Scheduler internal process KubernetesJobWatcher killed, because it receives error code 410 from the Kubernetes API and can't process it properly.
Am I right?

0 replies

crabio · 2023-11-04T15:22:40Z

crabio
Nov 4, 2023
Author

I also found some error in scheduling DAGs with long name. 1 DAG has long task name and as a result Kubernetes API raises an error, because name of pod should is limited by 63 symbols.
I'll try to fix it from my side and check. Maybe it causes and error with shadow not empty slots.

Do you know about any fixes about it? It is not well that I can't use normal DAGs and Tasks names longer than 63 symbols with Kubernetes Executor.

1 reply

crabio Nov 4, 2023
Author

This error is not related to the open slots amount in the Scheduler. I fixed tasks name lengths but issue still presents

jscheffl · 2023-11-04T19:46:19Z

jscheffl
Nov 4, 2023
Collaborator

Mhm, seems at least I am not an K8s expert but following what If could find for the error Reason: Expired: too old resource version: 544496983 (544570223) in e.g. https://stackoverflow.com/questions/61409596/kubernetes-too-old-resource-version#62256755 my hypothesis is that the POD executing the workload somehow was restarted by Kubernetes while working - thus generating a new version and Airflow failed watching (because probably status polling is on the "old" version).

If this hypothesis is correct then (1) you might need to check why PODs are re-started. Are they running out of resources and need more reservations (e.g. RAM on the K8s not is running out?) or do you have movement in infrastructure (a node is drained while workload is running?).
Then alongside (2) if this is right, we might need to check the logic in scheduler of KubernetesExecutor to be able to handle such cases and be able to follow the most recent version - this would really need a bugfix. For (2) I would like to have a confirmation about hypothesis and then we need a way to be able to re-produce this. Can you help us with this?

1 reply

crabio Nov 4, 2023
Author

Hm.. let me try to check this out. I'll try to catch it

Yes, I can try to check logic and prepare PR for it in the future

crabio · 2023-11-04T20:37:39Z

crabio
Nov 4, 2023
Author

It seems like my issue is so related with #13542

0 replies

crabio · 2023-11-04T20:54:57Z

crabio
Nov 4, 2023
Author

I found the main thing in logs.
After all tasks finished, Scheduler writes in logs that it has some tasks running:

But when I check in the UI, I have no tasks running. They finished successfully or with error. No colleration found for now:

0 replies

crabio · 2023-11-04T20:57:07Z

crabio
Nov 4, 2023
Author

And also I have 2 schedulers and they think that they have different tasks running for now:

But definitely I have no running tasks.

0 replies

crabio · 2023-11-04T22:25:29Z

crabio
Nov 4, 2023
Author

i think I have found a race in the Airflow Scheduler:

Task finished

[�[34m2023-11-04T21:18:31.502+0000�[0m] {�[34mkubernetes_executor_utils.py:�[0m464} DEBUG�[0m - finishing job TaskInstanceKey(dag_id='ods-crm-60', task_id='tx_outbox_crm_pd', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1) - failed (ods-crm-60-tx-outbox-crm-pd-25s8n5iq)�[0m

Executor tried to delete task, but it doesn't exist in the self.running dictionary

[�[34m2023-11-04T21:18:31.510+0000�[0m] {�[34mkubernetes_executor.py:�[0m449} DEBUG�[0m - TI key not in running, not adding to event_buffer: TaskInstanceKey(dag_id='ods-crm-60', task_id='tx_outbox_crm_pd', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1)�[0m

scheduler_job_runner sends task instance to executor to run finished task. And after that task marked as done, but it is saved in kubernetes executor memory as running.

[�[34m2023-11-04T21:19:30.867+0000�[0m] {�[34mscheduler_job_runner.py:�[0m636} INFO�[0m - Sending TaskInstanceKey(dag_id='ods-crm-60', task_id='tx_outbox_crm_pd', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=2, map_index=-1) to executor with priority 1 and queue default�[0m
[�[34m2023-11-04T21:19:30.872+0000�[0m] {�[34mkubernetes_executor.py:�[0m344} DEBUG�[0m - self.running: {TaskInstanceKey(dag_id='ods-communications-60', task_id='precollection_users', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-crm-60', task_id='tx_outbox_crm_pd', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=2, map_index=-1), TaskInstanceKey(dag_id='bs-papi-60-new', task_id='arcus_payment_buffer', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-communications-60', task_id='calls_cold', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-processing-finance-60', task_id='stg_delete_pf_en_accounts', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-processing-finance-60', task_id='stg_delete_pf_en_statements', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-processing-finance-60', task_id='tranches', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-communications-60', task_id='blue_tape_exporter_runs', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-processing-finance-60', task_id='stg_delete_pf_en_events', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-communications-60', task_id='meetingabandoned', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=1, map_index=-1), TaskInstanceKey(dag_id='ods-processing-payments-60', task_id='transaction_tasks', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=2, map_index=-1), TaskInstanceKey(dag_id='ods-processing-payments-60', task_id='accounts_pp_pam', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=2, map_index=-1), TaskInstanceKey(dag_id='ods-processing-payments-60', task_id='categories', run_id='scheduled__2023-11-04T20:00:00+00:00', try_number=2, map_index=-1)}�[0m

2 replies

jscheffl Nov 4, 2023
Collaborator

Just to be sure - and I might need to take alook to the code or we need an expert on KubernetesExecutor here - in (1) The task finishes - but in failure. Can you confirm that the POD terminated in an error?

crabio Nov 5, 2023
Author

Yes, pod terminated with state Failed

crabio · 2023-11-05T09:02:41Z

crabio
Nov 5, 2023
Author

I found a working workaround!
I had 2 replicas of Scheduler.
For last night I set 1 scheduler replica for simplier debugging and issue gone!
No shadow slots allocated after some time.

Seems like race between multiple schedulers with the Kubernetes Executor.
But I'll analyze deeply, because I think it is not good solution and not scallable at the future

2 replies

jscheffl Nov 5, 2023
Collaborator

For tracing down the root of the problem this is very good! Might be really you found an issue.

crabio Nov 5, 2023
Author

On one env it helps and for 12 hours Airflow works nice.
But on another env, where tasks failed some times - I have slots leakage.

Going to analyze logs in failed env

crabio · 2023-11-09T11:49:12Z

crabio
Nov 9, 2023
Author

I found interesting insight.
For 3 days Airflow Scheduler works fine, but after this period it missed finished tasks.

For 3 hours I have 830 Add task TaskInstanceKey logs and only 820 TaskInstance Finished:. Seems like Scheduler starts tasks, but didn't catch task finish. But all pods stops correctly with success status.

0 replies

crabio · 2023-11-09T20:01:47Z

crabio
Nov 9, 2023
Author

I found sequence of events for good and stucked task in the scheduler.
But I can't track issue through the Scheduler code.

Good task execution debug:

<TaskInstance: bs-pc-60.h_galileo_message scheduled__2023-11-09T07:00:00+00:00 [scheduled]>
Adding to queue: ['airflow', 'tasks', 'run', 'bs-pc-60', 'h_galileo_message', 'scheduled__2023-11-09T07:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/dags/bs/bs-60.py']
Sending TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
Add task TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1) with command ['airflow', 'tasks', 'run', 'bs-pc-60', 'h_galileo_message', 'scheduled__2023-11-09T07:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/dags/bs/bs-60.py'], executor_config {}
self.running: …
Creating kubernetes pod for job is TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1)
Received executor event with state queued for task instance TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1)
Setting external_id for <TaskInstance: bs-pc-60.h_galileo_message scheduled__2023-11-09T07:00:00+00:00 [queued]> to 213378
Creating task key for annotations {'dag_id': 'bs-pc-60', 'task_id': 'h_galileo_message', 'execution_date': None, 'run_id': 'scheduled__2023-11-09T07:00:00+00:00', 'try_number': '1'}
Processing task ('bs-pc-60-h-galileo-message-r3qx200y', 'dwh-airflow-stable', None, {'dag_id': 'bs-pc-60', 'task_id': 'h_galileo_message', 'execution_date': None, 'run_id': 'scheduled__2023-11-09T07:00:00+00:00', 'try_number': '1'}, '560434451')
finishing job TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1) - None (bs-pc-60-h-galileo-message-r3qx200y)
Changing state of (TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1), None, 'bs-pc-60-h-galileo-message-r3qx200y', 'dwh-airflow-stable', '560434451') to None
Deleted pod: TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1) in namespace dwh-airflow-stable�
Received executor event with state success for task instance TaskInstanceKey(dag_id='bs-pc-60', task_id='h_galileo_message', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1)
TaskInstance Finished: dag_id=bs-pc-60, task_id=h_galileo_message, run_id=scheduled__2023-11-09T07:00:00+00:00, map_index=-1, run_start_date=2023-11-09 08:02:57.642736+00:00, run_end_date=2023-11-09 08:03:49.373926+00:00, run_duration=51.73119, state=success, executor_state=success, try_number=1, max_tries=2, job_id=213399, pool=staging_to_base_pool, queue=default, priority_weight=1, operator=BashOperator, queued_dttm=2023-11-09 08:00:02.044063+00:00, queued_by_job_id=213378, pid=94

Stucked task execution debug (applications_info):

Adding to queue: ['airflow', 'tasks', 'run', 'ods-pos-60', 'applications_info', 'scheduled__2023-11-09T07:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/dags/ods/ods-60.py']
Sending TaskInstanceKey(dag_id='ods-pos-60', task_id='applications_info', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1) to executor with priority 1 and queue default
<TaskInstance: ods-pos-60.applications_info scheduled__2023-11-09T07:00:00+00:00 [scheduled]>
Add task TaskInstanceKey(dag_id='ods-pos-60', task_id='applications_info', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1) with command ['airflow', 'tasks', 'run', 'ods-pos-60', 'applications_info', 'scheduled__2023-11-09T07:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/dags/ods/ods-60.py'], executor_config {}
Creating kubernetes pod for job is TaskInstanceKey(dag_id='ods-pos-60', task_id='applications_info', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1)
Received executor event with state queued for task instance TaskInstanceKey(dag_id='ods-pos-60', task_id='applications_info', run_id='scheduled__2023-11-09T07:00:00+00:00', try_number=1, map_index=-1)
Setting external_id for <TaskInstance: ods-pos-60.applications_info scheduled__2023-11-09T07:00:00+00:00 [queued]> to 213378
Checking task instance <TaskInstance: ods-pos-60.applications_info scheduled__2023-11-09T07:00:00+00:00 [queued]>

0 replies

crabio · 2023-11-09T20:34:25Z

crabio
Nov 9, 2023
Author

@jscheffl
I found main issue!
Tasks Pods were Evicted!

Scheduler starts tasks pods
Task starts execution
Kubernetes scale-down and Evicts some tasks pods, because I had no annotations to prevent it
Scheduler continue think that task is running, because it has no checker that task is still existing in Kubernetes. and just waits signal from it

Solution:

Add annotation to prevent tasks evicting: cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
Add logic to Scheduler to check periodically that pod which created before is still running

What do you think?

2 replies

jscheffl Nov 9, 2023
Collaborator

Hi @crabio I am not not that deep in eviction rules but that sounds reasonable.

Eviction also can happen (sometimes) if the POD runs out-of-ephimeral storage. A storage reservation might help in such cases. But if you found eviction events this really is one potential cause.
Until otherwise completely fixed you might also consider using a static node pool for the worker which does not use auto scaling. But depends on your workload and variation in load of course.

crabio Nov 10, 2023
Author

Thank you, but I checked - it is no issue with ephimeral storage.

Possible reasons:

Kubernetes evicts pod
pgbouncer can be rebalanced and relaunched, what raises tasks failures. But they will be restarted, if it has retries number set

crabio · 2023-11-10T10:28:09Z

crabio
Nov 10, 2023
Author

I found race condition between 2 schedulers.

Scheduler 1 starts task pod
task finished
Scheduler 2 processes task finish and kill pod
Scheduler 1 thinks that task is still running, but Scheduler 2 updates tasks status in the database.

As workaround I'll use 1 Scheduler replica for Kubernetes Executor.
But I think that we need to create an issue for this

1 reply

crabio Nov 10, 2023
Author

logs:
Untitled discover search.csv

dirrao · 2023-12-29T04:12:23Z

dirrao
Dec 29, 2023
Collaborator

@crabio
The K8s executor open_slots are full over the time. This is due to the adoption of completed pods by schedulers. We have found a leak in the open slots. It will be accumulated over time and no slots are available to schedule the tasks. This is fixed in the #36240. Multiple fixes are available in the CNCF Kubernetes provider 7.12.0. You can try and let us know.

0 replies

crabio · 2024-01-02T12:44:50Z

crabio
Jan 2, 2024
Author

Fixed in #36478

0 replies

Airflow Scheduler with Kubernetes Executor has errors in logs and stuck slots with no running tasks #35426

crabio Nov 1, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 24 comments · 10 replies

boring-cyborg[bot] bot Nov 1, 2023

crabio Nov 2, 2023 Author

jscheffl Nov 2, 2023 Collaborator

crabio Nov 3, 2023 Author

crabio Nov 3, 2023 Author

jscheffl Nov 3, 2023 Collaborator

crabio Nov 3, 2023 Author

crabio Nov 3, 2023 Author

potiuk Nov 3, 2023 Collaborator

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

jscheffl Nov 4, 2023 Collaborator

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

crabio Nov 4, 2023 Author

jscheffl Nov 4, 2023 Collaborator

crabio Nov 5, 2023 Author

crabio Nov 5, 2023 Author

jscheffl Nov 5, 2023 Collaborator

crabio Nov 5, 2023 Author

crabio Nov 9, 2023 Author

crabio Nov 9, 2023 Author

crabio Nov 9, 2023 Author

jscheffl Nov 9, 2023 Collaborator

crabio Nov 10, 2023 Author

crabio Nov 10, 2023 Author

crabio Nov 10, 2023 Author

dirrao Dec 29, 2023 Collaborator

crabio Jan 2, 2024 Author

crabio
Nov 1, 2023

Replies: 24 comments 10 replies

boring-cyborg[bot]
bot Nov 1, 2023

crabio
Nov 2, 2023
Author

jscheffl
Nov 2, 2023
Collaborator

crabio
Nov 3, 2023
Author

crabio
Nov 3, 2023
Author

jscheffl
Nov 3, 2023
Collaborator

crabio
Nov 3, 2023
Author

crabio
Nov 3, 2023
Author

potiuk
Nov 3, 2023
Collaborator

crabio Nov 4, 2023
Author

crabio
Nov 4, 2023
Author

crabio
Nov 4, 2023
Author

crabio
Nov 4, 2023
Author

crabio Nov 4, 2023
Author

jscheffl
Nov 4, 2023
Collaborator

crabio Nov 4, 2023
Author

crabio
Nov 4, 2023
Author

crabio
Nov 4, 2023
Author

crabio
Nov 4, 2023
Author

crabio
Nov 4, 2023
Author

jscheffl Nov 4, 2023
Collaborator

crabio Nov 5, 2023
Author

crabio
Nov 5, 2023
Author

jscheffl Nov 5, 2023
Collaborator

crabio Nov 5, 2023
Author

crabio
Nov 9, 2023
Author

crabio
Nov 9, 2023
Author

crabio
Nov 9, 2023
Author

jscheffl Nov 9, 2023
Collaborator

crabio Nov 10, 2023
Author

crabio
Nov 10, 2023
Author

crabio Nov 10, 2023
Author

dirrao
Dec 29, 2023
Collaborator

crabio
Jan 2, 2024
Author