Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add startup_check_interval_seconds to PodManager's await_pod_start #31008

Closed

Conversation

stelsemeyer-m60
Copy link
Contributor


Parametrize the interval in which the Kubernetes pod status is polled when launching a new pod.

When using serverless Kubernetes services like Google GKE Autopilot the pod startup time is sometimes expected to be longer due to a cold start. Therefore the logs might be spammed due to the default checks every second (see below), and a lower check frequency might be desired

[2023-05-02, 05:33:22 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:23 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:24 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:25 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:26 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:27 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:28 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:29 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:30 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:31 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:32 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:33 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:34 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:35 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:36 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
[2023-05-02, 05:33:37 UTC] {pod_manager.py:187} WARNING - Pod not yet started: some-pod-he2j8139
...

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels May 2, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented May 2, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@stelsemeyer-m60 stelsemeyer-m60 marked this pull request as ready for review May 2, 2023 07:05
@potiuk
Copy link
Member

potiuk commented May 7, 2023

Would it be possible to add/update a unit test for that one?

@stelsemeyer-m60
Copy link
Contributor Author

Would it be possible to add/update a unit test for that one?

I had trouble getting the unit tests running locally (while the documentation seems to be extensive though).

@potiuk
Copy link
Member

potiuk commented May 9, 2023

Would it be possible to add/update a unit test for that one?

I had trouble getting the unit tests running locally (while the documentation seems to be extensive though).

With breeze they should work out-of-the-box (same with static checks). Without you fixing them, we cannot do much to merge it because it will break other people's workflows (we have 60-70 commits from about ~50 people a week, so you need to fix those to get them merged.

@stelsemeyer-m60
Copy link
Contributor Author

Would it be possible to add/update a unit test for that one?

I had trouble getting the unit tests running locally (while the documentation seems to be extensive though).

With breeze they should work out-of-the-box (same with static checks). Without you fixing them, we cannot do much to merge it because it will break other people's workflows (we have 60-70 commits from about ~50 people a week, so you need to fix those to get them merged.

Yeah, absolutely understandable. Will try to get them running and report back!

@stelsemeyer-m60
Copy link
Contributor Author

@potiuk : Please check again. I fixed all failing tests and added a quite naive one for the newly added parameter.

@eladkal
Copy link
Contributor

eladkal commented May 21, 2023

@stelsemeyer-m60 can you add entry in docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst
suggesting users to consider changing the default value of this parameter to handle warnings due to cold start?
I think it's best to document it so users will know about this option (I thought of suggesting overriding the value in GKEStartPodOperator but that creates a coupling with newer cncf.kubernetes provider version which might not be desired)

@stelsemeyer-m60
Copy link
Contributor Author

@stelsemeyer-m60 can you add entry in docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst suggesting users to consider changing the default value of this parameter to handle warnings due to cold start? I think it's best to document it so users will know about this option (I thought of suggesting overriding the value in GKEStartPodOperator but that creates a coupling with newer cncf.kubernetes provider version which might not be desired)

Good idea. Done ✅

Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Will merge when CI is green

@stelsemeyer-m60
Copy link
Contributor Author

LGTM

Will merge when CI is green

Thanks. Sorry, did not have the chance to trace the reason for the remaining errors yet.

@potiuk potiuk force-pushed the startup_check_interval branch from a7520e3 to 7bc5bc8 Compare May 27, 2023 07:27
@potiuk
Copy link
Member

potiuk commented May 27, 2023

It looks like some intermittent problems were involved. I rebased the PR to re-run the tests.

@@ -174,6 +174,7 @@ class KubernetesPodOperator(BaseOperator):
during the next try. If False, always create a new pod for each try.
:param labels: labels to apply to the Pod. (templated)
:param startup_timeout_seconds: timeout in seconds to startup the pod.
:param startup_check_interval_seconds: interval in seconds to check if the pod has already started
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use startup_timeout_check_interval_seconds instead.

@@ -595,6 +602,7 @@ def invoke_defer_method(self):
should_delete_pod=self.is_delete_operator_pod,
get_logs=self.get_logs,
startup_timeout=self.startup_timeout_seconds,
startup_check_interval=self.startup_check_interval_seconds,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this does anything. The trigger uses poll_interval instead. Feels a little duplicative. Maybe we deprecate poll_interval and just use this new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it is not doing anything here. Will fix this.
Moreover, I have no strong opinion in terms of deprecating poll_interval or renaming the new parameter startup_check_interval_seconds/startup_timeoutcheck_interval_seconds to poll_interval in the new implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, going through the code base, I came across different occurrences of unparametrized time.sleep calls (with either 1 or 2 seconds):

  1. airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.await_pod_start: the startup polling, set to 1s
  2. airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs: polling that follows the logs, set to 1s
  3. airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.await_container_completion: polling of container state if log following is deactivated, set to 1s
  4. airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.await_pod_completion: same as above for container completion, set to 2s
  5. airflow.providers.cncf.kubernetes.triggers.pod.KubernetesPodTrigger.run: set to 2s

I think we want to change 1 but not 2 to 4. I am not sure for 5 though, because it does also check for the status once the pod is running, no? Please correct me if I am wrong.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 15, 2023
@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jul 17, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 31, 2023
@github-actions github-actions bot closed this Sep 6, 2023
@stelsemeyer-m60
Copy link
Contributor Author

(Test if comment reopens.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants