Add startup_check_interval_seconds to PodManager's await_pod_start #31008

stelsemeyer-m60 · 2023-05-02T06:57:51Z

Parametrize the interval in which the Kubernetes pod status is polled when launching a new pod.

When using serverless Kubernetes services like Google GKE Autopilot the pod startup time is sometimes expected to be longer due to a cold start. Therefore the logs might be spammed due to the default checks every second (see below), and a lower check frequency might be desired

[2023-05-02, 05:33:22 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:23 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:24 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:25 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:26 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:27 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:28 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:29 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:30 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:31 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:32 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:33 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:34 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:35 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:36 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:37 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
...

boring-cyborg · 2023-05-02T06:57:55Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: [email protected]
Slack: https://s.apache.org/airflow-slack

potiuk · 2023-05-07T06:36:39Z

Would it be possible to add/update a unit test for that one?

stelsemeyer-m60 · 2023-05-08T23:39:48Z

Would it be possible to add/update a unit test for that one?

I had trouble getting the unit tests running locally (while the documentation seems to be extensive though).

potiuk · 2023-05-09T09:20:40Z

Would it be possible to add/update a unit test for that one?

I had trouble getting the unit tests running locally (while the documentation seems to be extensive though).

With breeze they should work out-of-the-box (same with static checks). Without you fixing them, we cannot do much to merge it because it will break other people's workflows (we have 60-70 commits from about ~50 people a week, so you need to fix those to get them merged.

stelsemeyer-m60 · 2023-05-09T12:21:57Z

Would it be possible to add/update a unit test for that one?

I had trouble getting the unit tests running locally (while the documentation seems to be extensive though).

With breeze they should work out-of-the-box (same with static checks). Without you fixing them, we cannot do much to merge it because it will break other people's workflows (we have 60-70 commits from about ~50 people a week, so you need to fix those to get them merged.

Yeah, absolutely understandable. Will try to get them running and report back!

stelsemeyer-m60 · 2023-05-09T22:59:43Z

@potiuk : Please check again. I fixed all failing tests and added a quite naive one for the newly added parameter.

eladkal · 2023-05-21T08:09:14Z

@stelsemeyer-m60 can you add entry in docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst
suggesting users to consider changing the default value of this parameter to handle warnings due to cold start?
I think it's best to document it so users will know about this option (I thought of suggesting overriding the value in GKEStartPodOperator but that creates a coupling with newer cncf.kubernetes provider version which might not be desired)

stelsemeyer-m60 · 2023-05-21T09:59:26Z

@stelsemeyer-m60 can you add entry in docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst suggesting users to consider changing the default value of this parameter to handle warnings due to cold start? I think it's best to document it so users will know about this option (I thought of suggesting overriding the value in GKEStartPodOperator but that creates a coupling with newer cncf.kubernetes provider version which might not be desired)

Good idea. Done ✅

eladkal

LGTM

Will merge when CI is green

stelsemeyer-m60 · 2023-05-27T07:06:56Z

LGTM

Will merge when CI is green

Thanks. Sorry, did not have the chance to trace the reason for the remaining errors yet.

potiuk · 2023-05-27T07:27:46Z

It looks like some intermittent problems were involved. I rebased the PR to re-run the tests.

jedcunningham · 2023-05-30T22:15:22Z

airflow/providers/cncf/kubernetes/operators/pod.py

@@ -174,6 +174,7 @@ class KubernetesPodOperator(BaseOperator):
        during the next try. If False, always create a new pod for each try.
    :param labels: labels to apply to the Pod. (templated)
    :param startup_timeout_seconds: timeout in seconds to startup the pod.
+    :param startup_check_interval_seconds: interval in seconds to check if the pod has already started


I think we should use startup_timeout_check_interval_seconds instead.

jedcunningham · 2023-05-30T22:19:38Z

airflow/providers/cncf/kubernetes/operators/pod.py

@@ -595,6 +602,7 @@ def invoke_defer_method(self):
                should_delete_pod=self.is_delete_operator_pod,
                get_logs=self.get_logs,
                startup_timeout=self.startup_timeout_seconds,
+                startup_check_interval=self.startup_check_interval_seconds,


I'm not convinced this does anything. The trigger uses poll_interval instead. Feels a little duplicative. Maybe we deprecate poll_interval and just use this new one?

Agree, it is not doing anything here. Will fix this.
Moreover, I have no strong opinion in terms of deprecating poll_interval or renaming the new parameter startup_check_interval_seconds/startup_timeoutcheck_interval_seconds to poll_interval in the new implementation.

Ok, going through the code base, I came across different occurrences of unparametrized time.sleep calls (with either 1 or 2 seconds):

airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.await_pod_start: the startup polling, set to 1s

airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs: polling that follows the logs, set to 1s

airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.await_container_completion: polling of container state if log following is deactivated, set to 1s

airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.await_pod_completion: same as above for container completion, set to 2s

airflow.providers.cncf.kubernetes.triggers.pod.KubernetesPodTrigger.run: set to 2s

I think we want to change 1 but not 2 to 4. I am not sure for 5 though, because it does also check for the status once the pod is running, no? Please correct me if I am wrong.

github-actions · 2023-07-15T00:13:48Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

…er-m60/airflow into startup_check_interval

github-actions · 2023-08-31T06:27:10Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

stelsemeyer-m60 · 2023-09-09T06:00:39Z

(Test if comment reopens.)

add startup_check_interval_seconds

2d65f3b

boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels May 2, 2023

change default value in method

f66e47f

stelsemeyer-m60 marked this pull request as ready for review May 2, 2023 07:05

stelsemeyer-m60 requested a review from jedcunningham as a code owner May 2, 2023 07:05

potiuk approved these changes May 7, 2023

View reviewed changes

stelsemeyer-m60 added 4 commits May 9, 2023 22:45

fix static checks, add missing param, fix typo

571cb61

default is 1s

149e1de

fix outdated docs

48a6a37

add test to check time.sleep is called with specific value

e8bfa0a

Merge remote-tracking branch 'upstream/main' into startup_check_interval

acd1377

eladkal approved these changes May 21, 2023

View reviewed changes

Merge branch 'main' into startup_check_interval

4e27828

add more documentation

23bf544

rephrase

260916d

eladkal approved these changes May 21, 2023

View reviewed changes

stelsemeyer-m60 added 4 commits May 27, 2023 08:27

add startup_check_interval_seconds

023db3c

change default value in method

810d92e

fix static checks, add missing param, fix typo

602e26a

default is 1s

3da5b57

stelsemeyer-m60 added 4 commits May 27, 2023 08:27

fix outdated docs

a4ae513

add test to check time.sleep is called with specific value

e8facd3

add more documentation

11c3157

rephrase

7bc5bc8

potiuk force-pushed the startup_check_interval branch from a7520e3 to 7bc5bc8 Compare May 27, 2023 07:27

jedcunningham reviewed May 30, 2023

View reviewed changes

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 15, 2023

Merge branch 'startup_check_interval' of https://github.com/stelsemey…

a75e430

…er-m60/airflow into startup_check_interval

stelsemeyer-m60 mentioned this pull request Jul 16, 2023

Parametrize poll_interval in KubernetesPodOperator #32631

Closed

github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 17, 2023

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 31, 2023

github-actions bot closed this Sep 6, 2023

stelsemeyer-m60 mentioned this pull request Sep 9, 2023

Add startup_check_interval_seconds to PodManager's await_pod_start #34231

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add startup_check_interval_seconds to PodManager's await_pod_start #31008

Add startup_check_interval_seconds to PodManager's await_pod_start #31008

stelsemeyer-m60 commented May 2, 2023

boring-cyborg bot commented May 2, 2023

potiuk commented May 7, 2023

stelsemeyer-m60 commented May 8, 2023

potiuk commented May 9, 2023

stelsemeyer-m60 commented May 9, 2023

stelsemeyer-m60 commented May 9, 2023

eladkal commented May 21, 2023 •

edited

Loading

stelsemeyer-m60 commented May 21, 2023

eladkal left a comment

stelsemeyer-m60 commented May 27, 2023

potiuk commented May 27, 2023

jedcunningham May 30, 2023

jedcunningham May 30, 2023

stelsemeyer-m60 Jul 16, 2023

stelsemeyer-m60 Jul 16, 2023

github-actions bot commented Jul 15, 2023

github-actions bot commented Aug 31, 2023

stelsemeyer-m60 commented Sep 9, 2023

Add startup_check_interval_seconds to PodManager's await_pod_start #31008

Add startup_check_interval_seconds to PodManager's await_pod_start #31008

Conversation

stelsemeyer-m60 commented May 2, 2023

boring-cyborg bot commented May 2, 2023

potiuk commented May 7, 2023

stelsemeyer-m60 commented May 8, 2023

potiuk commented May 9, 2023

stelsemeyer-m60 commented May 9, 2023

stelsemeyer-m60 commented May 9, 2023

eladkal commented May 21, 2023 • edited Loading

stelsemeyer-m60 commented May 21, 2023

eladkal left a comment

Choose a reason for hiding this comment

stelsemeyer-m60 commented May 27, 2023

potiuk commented May 27, 2023

jedcunningham May 30, 2023

Choose a reason for hiding this comment

jedcunningham May 30, 2023

Choose a reason for hiding this comment

stelsemeyer-m60 Jul 16, 2023

Choose a reason for hiding this comment

stelsemeyer-m60 Jul 16, 2023

Choose a reason for hiding this comment

github-actions bot commented Jul 15, 2023

github-actions bot commented Aug 31, 2023

stelsemeyer-m60 commented Sep 9, 2023

eladkal commented May 21, 2023 •

edited

Loading