Replies: 24 comments 10 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
Also we figured out that some time tasks failed without any reason and without retries.
But in next run this task succeed. |
Beta Was this translation helpful? Give feedback.
-
Your descriptions are a bit vague but I can say that the scheduler does not create empty slots in advance with KubernetesExecutor. For each task being executed, one POD is launched by the scheduler. Are there any tasks starting at all or are all tasks failing? If some are running and some not, is it sporadic or certain tasks fail and other run in success? Any pattern observable? |
Beta Was this translation helpful? Give feedback.
-
In my case I see that Airflow Scheduler after some time not queuing tasks. But we have empty slots in pool and all components are running. No pods are spawned. All tasks are failed or skipped without attempts. After restart Airflow Scheduling pods - all working fine for some time. We have this only on env with a 200+ DAGs. |
Beta Was this translation helpful? Give feedback.
-
Maybe it is same issue: Some times tasks is running well after Scheduling restart. But random task can fail without any logs or attempts. This thing started happening after switch from Celery to KubernetesExecutor |
Beta Was this translation helpful? Give feedback.
-
This rather sounds like troubleshooting. |
Beta Was this translation helpful? Give feedback.
-
Why it is discussion? |
Beta Was this translation helpful? Give feedback.
-
I described my issue, that all worked fine with Airflow Celery Worker, but not with Kubernetes Executor. I wrote that nothing happened in the Sheduller logs |
Beta Was this translation helpful? Give feedback.
-
Why do you think it changes anything from your perspective whether it is an issue or discussion @crabio ? What's your expectation here? For me discussion means that - if ther will be more details, and it will be clearly a bug in airflow that allows someone to reproduce the error -we might classify it as an issue in Airlfow. if it is unclear - it is a discussion. In both cases, it's realy up to the author to provide enough evidences that will allow people (who help others here in their free time} to help the author with their troubleshoiting. In case of an issue in airflow, they might even be able to fix it for everyone. What's your expectatiom actualy (for the software that you get absolutely for free without any guarantees|? Why do you think it matters for you to have it as an issue? What do you think will happen differently? |
Beta Was this translation helpful? Give feedback.
-
I'll try to increase log level to DEBUG on all Airflow components and try to catch some details about issue |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I also found some error in scheduling DAGs with long name. 1 DAG has long task name and as a result Kubernetes API raises an error, because name of pod should is limited by 63 symbols. Do you know about any fixes about it? It is not well that I can't use normal DAGs and Tasks names longer than 63 symbols with Kubernetes Executor. |
Beta Was this translation helpful? Give feedback.
-
Mhm, seems at least I am not an K8s expert but following what If could find for the error If this hypothesis is correct then (1) you might need to check why PODs are re-started. Are they running out of resources and need more reservations (e.g. RAM on the K8s not is running out?) or do you have movement in infrastructure (a node is drained while workload is running?). |
Beta Was this translation helpful? Give feedback.
-
It seems like my issue is so related with #13542 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
And also I have 2 schedulers and they think that they have different tasks running for now: But definitely I have no running tasks. |
Beta Was this translation helpful? Give feedback.
-
i think I have found a race in the Airflow Scheduler:
|
Beta Was this translation helpful? Give feedback.
-
I found a working workaround! Seems like race between multiple schedulers with the Kubernetes Executor. |
Beta Was this translation helpful? Give feedback.
-
I found interesting insight. For 3 hours I have 830 |
Beta Was this translation helpful? Give feedback.
-
I found sequence of events for good and stucked task in the scheduler. Good task execution debug:
Stucked task execution debug (applications_info):
|
Beta Was this translation helpful? Give feedback.
-
@jscheffl
Solution:
What do you think? |
Beta Was this translation helpful? Give feedback.
-
I found race condition between 2 schedulers.
As workaround I'll use 1 Scheduler replica for Kubernetes Executor. |
Beta Was this translation helpful? Give feedback.
-
@crabio |
Beta Was this translation helpful? Give feedback.
-
Fixed in #36478 |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
2.7.2
What happened
After some analyze it seems confusing that in kubernetes provider v7.8.0 it has code to process it:
But not clear, why we still have an error.
Also I saw that this error was a lot of times in other versions of Airflow.
What you think should happen instead
No response
How to reproduce
Operating System
Docker based on apache/airflow:2.7.2
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
apache-airflow-providers-cncf-kubernetes==7.8.0
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions