Skip to content

Commit

Permalink
Remove reference to undead tasks from documentation (#43536)
Browse files Browse the repository at this point in the history

---------

Co-authored-by: Ryan Hatter <[email protected]>
  • Loading branch information
karenbraganz and RNHTTR authored Jan 26, 2025
1 parent a14aedb commit 88a8eff
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 47 deletions.
4 changes: 2 additions & 2 deletions airflow/jobs/scheduler_job_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -2017,14 +2017,14 @@ def _purge_zombies(self, zombies: list[tuple[TI, str]], *, session: Session) ->
f"Task did not emit heartbeat within time limit ({self._zombie_threshold_secs} "
"seconds) and will be terminated. "
"See https://airflow.apache.org/docs/apache-airflow/"
"stable/core-concepts/tasks.html#zombie-undead-tasks"
"stable/core-concepts/tasks.html#zombie-tasks"
),
)
)
self.log.error(
"Detected zombie job: %s "
"(See https://airflow.apache.org/docs/apache-airflow/"
"stable/core-concepts/tasks.html#zombie-undead-tasks)",
"stable/core-concepts/tasks.html#zombie-tasks)",
request,
)
self.job.executor.send_callback(request)
Expand Down
55 changes: 10 additions & 45 deletions docs/apache-airflow/core-concepts/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,55 +167,20 @@ These can be useful if your code has extra knowledge about its environment and w

.. _concepts:zombies:

Zombie/Undead Tasks
-------------------
Zombie Tasks
------------

No system runs perfectly, and task instances are expected to die once in a while. Airflow detects two kinds of task/process mismatch:
No system runs perfectly, and task instances are expected to die once in a while.

* *Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for
many reasons, including:
*Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for
many reasons, including:

* The Airflow worker ran out of memory and was OOMKilled.
* The Airflow worker failed its liveness probe, so the system (for example, Kubernetes) restarted the worker.
* The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another.
* The Airflow worker ran out of memory and was OOMKilled.
* The Airflow worker failed its liveness probe, so the system (for example, Kubernetes) restarted the worker.
* The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another.

* *Undead tasks* are tasks that are *not* supposed to be running but are, often caused when you manually edit Task
Instances via the UI. Airflow will find them periodically and terminate them.


Below is the code snippet from the Airflow scheduler that runs periodically to detect zombie/undead tasks.

.. exampleinclude:: /../../airflow/jobs/scheduler_job_runner.py
:language: python
:start-after: [START find_and_purge_zombies]
:end-before: [END find_and_purge_zombies]


The explanation of the criteria used in the above snippet to detect zombie tasks is as below:

1. **Task Instance State**

Only task instances in the RUNNING state are considered potential zombies.

2. **Job State and Heartbeat Check**

Zombie tasks are identified if the associated job is not in the RUNNING state or if the latest heartbeat of the job is
earlier than the calculated time threshold (limit_dttm). The heartbeat is a mechanism to indicate that a task or job is
still alive and running.

3. **Job Type**

The job associated with the task must be of type ``LocalTaskJob``.

4. **Queued by Job ID**

Only tasks queued by the same job that is currently being processed are considered.

These conditions collectively help identify running tasks that may be zombies based on their state, associated job
state, heartbeat status, job type, and the specific job that queued them. If a task meets these criteria, it is
considered a potential zombie, and further actions, such as logging and sending a callback request, are taken.

Reproducing zombie tasks locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
28 changes: 28 additions & 0 deletions docs/apache-airflow/static/redirects.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*!
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

document.addEventListener("DOMContentLoaded", function () {
const redirects = {
"zombie-undead-tasks": "zombie-tasks",
};
const fragment = window.location.hash.substring(1);
if (redirects[fragment]) {
window.location.hash = redirects[fragment];
}
});
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,7 @@ def _get_rst_filepath_from_path(filepath: pathlib.Path):
"administration-and-deployment/logging-monitoring/advanced-logging-configuration.html",
"howto/docker-compose/index.html",
]
html_js_files.append("redirects.js")
if PACKAGE_NAME.startswith("apache-airflow-providers"):
manual_substitutions_in_generated_html = ["example-dags.html", "operators.html", "index.html"]
if PACKAGE_NAME == "docker-stack":
Expand Down

0 comments on commit 88a8eff

Please sign in to comment.