-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid kube-config file. Expected key current-context in kube-config when using deferrable=True #34644
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Deferrable and no deferrable operators use the exact same method to load the kube config file since 7.0.0, I'm surprised that you have this exception only in deferrable mode. Since your config file doesn't have the current-context (default context), I wonder if you added |
I have tried quite a few different configurations at this point but there just seems to be an issue here. when running the below dag the only task that completes is the deferrable-false task, the other two look to be running the code and output hello-world with deferrable set and I see the dag status change to purple however the runs are failing still with the error below, I have checked the kube-config file and I can see there is a key contexts. I have re-opened by Google support case to asking their product team to test the dag themselves and install composer-2.4.3-airflow-2.5.3. If there is another suggests please let me know. ERROR[2023-10-13, 14:38:27 UTC] {standard_task_runner.py:100} ERROR - Failed to execute job 48004 for task deferrable-true-extended-conf (Invalid kube-config file. Expected key contexts in kube-config;
743598) TESTING DAGfrom airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
import airflow
from airflow import DAG
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'tommy_test_kub_simple_dag',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='*/10 * * * *',
max_active_runs=2,
catchup=False,
dagrun_timeout=timedelta(minutes=10),
) as dag:
task1 = KubernetesPodOperator(
name="deferrable-true",
image="python:3.11-slim",
cmds=['python', '-c', "print('hello world')"],
task_id="deferrable-true",
config_file="/home/airflow/composer_kube_config",
deferrable=True,
in_cluster=False
)
task2 = KubernetesPodOperator(
name="deferrable-false",
image="python:3.11-slim",
cmds=['python', '-c', "print('hello world')"],
task_id="deferrable-false",
config_file="/home/airflow/composer_kube_config",
deferrable=False,
in_cluster=False
)
task3 = KubernetesPodOperator(
name="deferrable-true-extended-conf",
image="python:3.11-slim",
cmds=['python', '-c', "print('hello world')"],
task_id="deferrable-true-extended-conf",
kubernetes_conn_id="kubernetes_default",
deferrable=True,
in_cluster=False,
cluster_context="gke_my_orchestrater_id",
config_file="/home/airflow/composer_kube_config",
)
task1
task2
task3 ![]() |
@tommyhutcheson what do you think about avoiding those files and providing Kube Config in JSON, i think it should be possible. Having blocking operations (eg. files handling) in deferrable mode is in most cases a bad design, and I think community should aim to avoid that everywhere. Let me know if it was possible for you to provide this configuration via JSON. |
I created a BashOperator which prints the kube config in the composer env. The log shows it actually has
|
I reproduced the problem with the following DAG. Very interesting, the pod / container actually succeeded, we can see DAG:
Logs:
Anybody knows the possible reason? Or is there a way to enable debug logs in |
@hussein-awala, any idea about the observations above? |
I feel like the error messages and logs need to be improved. The current ones are not sufficient to figure out what went wrong in the trigger/hook. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
Friendly ping! |
@potiuk @hussein-awala can you please chime in? |
Issue ongoing keeping ticket open with author comment. |
The lack of more details seems to be because the message is not coming from airflow but from the POD. The message is "just" displayed by the airflow's KPO and the error is somewhere on the POD. There are likely two ways you can address your problem @tommyhutcheson:
However, I'd urge you to upgrade everything you can first. Many of our users experience problems that have long been solved and in this case there are quite many issues implemented since your version. You can either attempt it in the way that you want to be sure that you should upgrade (in which case I advise you to do detailed analysis of the changelog) or just upgrade and see if you still experience the problem. The latter is usually faster, and takes less time - both for you and volunteers here who have no time to go through detailed list of changelog just to make sure partcular problems have been fixed. This is an open-source project, so people here help when they have time (they are not paid for it) and in cases like that, it's quite a bit on the user to make sure to make the effort to ugprade to latest version in case they experience problems in later versions - especially in case there were many fixes since. |
Please let us know after you investigate and (hopefully) upgrade how the things go, so that we can (hopefully) close the ticket - in the meantime I mark it as pending response. |
Sorry for the delay. I will try to reproduce it and implement a fix before the next providers' release wave. |
Could you provide the Kubernetes conn you are using in your operator? (you can hide the confidential information) |
I tested I was able to reproduce the exception in only one case; are you sure the Kubernetes configuration file exists in the Triggerer pod and in the same path as the worker? In all the reports you provided, you confirmed that the config file is present in the worker (for example, when you tested with |
@hussein-awala thanks for your input! I will double check whether the config exists in the triggerer pod. But if that was the cause, do you know why in my previous repro, the operator container/pod succeeded with log |
It could work with a version < 7.0.0, but since #31322, the behavior was changed by stopping converting the file to dict and providing it to the Trigger, instead, we provide the config file path and we load it in the Trigger. This PR was a bug fix, also there was another reason for it (will explain more later). |
That would certainly explain the behaviour .. Nice one @hussein-awala :) |
It doesn't seem to be related, because the original problem occurred with 7.3.0, and I also reproduced the error with 7.9.0 in composer-2.5.2-airflow-2.6.3. I'm still trying to figure out how to verify the file exists in Trigger Pod. I'd appreciate if someone can provide a sample code for that! |
I said that deferrable mode works fine in a version <7.0.0 without adding the config file to the triggerer; reproducing the problem with 7.3.0 and 7.9.0 does not contradict what I said.
Quickly checking, I would say that providing extra files to the triggerer is impossible. I recommend contacting the support team of GCP to check with them if this is possible or not, and how they can support it if it's not supported. |
Before #29498 the flow was as following:
This way 2 things were achieved:
@hussein-awala Why has this process been reverted in #29498 ? |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
For the second reason, here is it: https://www.cve.org/CVERecord?id=CVE-2023-51702 We're working on an improvement for the trigger data stored in the database, once it's released, we will check how can we fix this issue. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
I am experiencing the same issue today, it appears there is no clear resolution in the comments above. Was this issue resolved? If so, how? |
same problem with airflow==2.10.2 apache-airflow-providers-cncf-kubernetes==9.0.0 kind 0.24.0
airflow connection "kubernetes_default": {
"conn_type": "kubernetes",
"extra": "{\"extra__kubernetes__in_cluster\": false, \"extra__kubernetes__kube_config_path\": \"/opt/airflow/include/.kube/config\", \"extra__kubernetes__namespace\": \"default\", \"extra__kubernetes__cluster_context\": \"kind-kind\", \"extra__kubernetes__disable_verify_ssl\": false, \"extra__kubernetes__disable_tcp_keepalive\": false, \"xcom_sidecar_container_image\": \"alpine:3.16.2\"}"
}
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
dag = DAG(
dag_id="kubernetes_dag",
schedule_interval=None,
start_date=days_ago(1),
)
with dag:
cmd = "echo toto && sleep 30 && echo finish && exit 1"
KubernetesPodOperator(
task_id="task-one",
namespace="default",
image_pull_policy="Never",
kubernetes_conn_id="kubernetes_default",
name="airflow-test-pod",
image="alpine:3.16.2",
cmds=["sh", "-c", cmd],
deferrable=True,
poll_interval=100,
do_xcom_push=True,
) |
@MCMcCallum @raphaelauv -> when you encounter closed issue (Especially closed months ago) with similar description, the best course of action is to open a new one - and describe your circumstances and case - ideally referring to the old issue as related. This allows to focus on your issue. Which might or might not be related - even if error message is similar. And you have a chance to restart the issue, focusing on - likely - much more fresh circumstances - your Airflow version, your K8s provider version etc. When you add "another" set of things to existing closed issue, it's entirely unclear for anyone who is looking at it - how to reproduce it. Is it the same issue? Or different? Should I look at the original report or a new one? etc. Also by opening the isssue You own it as an author - and when maintainer ask questions or mark it as "needs more information" it's clear that it's you who should provide it - not the original author, and it's also much more likely that you will do, because it is "fresh". So I heartily recommend to do so. |
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Hello
We are trying to use the deferrable option with the KubernetesPodOperator for the first time but we can't get past the error
Invalid kube-config file. Expected key current-context in kube-config
when using deferrable=True.We run airflow via cloud composer in GCP and first upgraded to composer-2.3.2-airflow-2.5.1 but had the issue so upgraded again to composer-2.4.3-airflow-2.5.3 having seen some posts about a fix in version 7.0.0 but we're still faced with the issue.
I have stripping the DAG operator back to basics :
operator is being imported from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
Error:
Cloud composer 2 guide states that we need to set the config_file to /home/airflow/composer_kube_config but the error we get seems to imply that the file is missing an expected key current-context. I raised this with google who suggested raisin this ticket here and stating that their product team confirmed that this issue is due to a problem with Airflow.
What you think should happen instead
Dag runs with deferrable=True parameter set, in this case printing hello world to the logs.
How to reproduce
Deploy the sample dag to airflow 2.5.3 if possible using cloud composer 2.4.3
Operating System
debian:11-slim
Versions of Apache Airflow Providers
Deployment
Google Cloud Composer
Deployment details
composer-2.4.3-airflow-2.5.3
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: