Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support timeout in BatchOperator #45619

Conversation

nrobinson-intelycare
Copy link
Contributor

When using a deferrable BatchOperator, it is possible to specify the max_retries and poll_interval, but this does not appear to terminate the job once the task times out.

Having the ability to specify the timeout and letting the job be terminated will allow us to enforce strict timeouts on critical pipelines running on Fargate via AWS Batch.

This PR adds support to allow specifying timeout in the BatchOperator constructor:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/batch/client/submit_job.html

timeout (dict) –

The timeout configuration for this SubmitJob operation. You can specify a timeout duration after which Batch terminates your jobs if they haven’t finished. If a job is terminated due to a timeout, it isn’t retried. The minimum value for the timeout is 60 seconds. This configuration overrides any timeout configuration specified in the job definition. For array jobs, child jobs have the same timeout configuration as the parent job. For more information, see Job Timeouts in the Amazon Elastic Container Service Developer Guide.

  • attemptDurationSeconds (integer) –

    The job timeout time (in seconds) that’s measured from the job attempt’s startedAt timestamp. After this time passes, Batch terminates your jobs if they aren’t finished. The minimum value for the timeout is 60 seconds.

    For array jobs, the timeout applies to the child jobs, not to the parent array job.

    For multi-node parallel (MNP) jobs, the timeout applies to the whole job, not to the individual nodes.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link

boring-cyborg bot commented Jan 13, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@@ -208,6 +210,7 @@ def __init__(
self.poll_interval = poll_interval
self.awslogs_enabled = awslogs_enabled
self.awslogs_fetch_interval = awslogs_fetch_interval
self.timeout = timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be mistakenly confused with Airflow task timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would batch_job_timeout be better?

Copy link
Contributor

@eladkal eladkal Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping same/similar name as boto3 interface is simpler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please lets not use timeout.
You can use the same name as boto3 or choose something else but not names that can cause confusion with BaseOperator parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boto3_timeout would make sense I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to boto3_timeout, batch_job_timeout, job_timeout etc, whatever is acceptable.

@@ -70,6 +70,7 @@ def setup_method(self, _, get_client_type_mock):
aws_conn_id="airflow_test",
region_name="eu-west-1",
tags={},
timeout={"attemptDurationSeconds": 3600},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincbeck Should I be adding timeout={"attemptDurationSeconds": 3600}, to assert_called_once_with calls?

https://github.com/apache/airflow/actions/runs/12751882848/job/35600073343?pr=45619#step:11:3020

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please

@nrobinson-intelycare
Copy link
Contributor Author

Something is up, I'll try to get the tests working locally

@nrobinson-intelycare
Copy link
Contributor Author

Moved to #45660 to avoid merge commits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants