-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Flink job tracking #120
Conversation
/cc @elanv |
Looks good! Thanks for this work. When I looked more closely, the job ID was not printed out in version <= 1.9 with blocking mode. (This could be guided by document) |
@regadas There is no functional difference, but alternative implementations are likely to be possible. There will be a little more work to do in the submitter, but the simplicity of the operator routine seems to be an advantage. Job submitter can consist of two containers: submitter and job ID provider. |
I actually like the simplicity of this approach. As I pointed out it seems that Spawning extra containers make things trickier in environments where the resource quotas are tight. Users need to have more understanding of the underlying infra to be able to specify appropriate resource quotas |
Looking at the resource quotas, the number of pods can be limited by quota, but not the number of containers. And in this case, IMO, it's effective to apply the pattern that separates the containers that write and read files. I think that the tracking logic is also more efficient and simpler because the operator only needs to check the ID under the condition that the ID provider container in the job submitter is in the terminated state. Getting the log stream of a running container may be affected by the state of that container and the node hosting it. After merging this PR, can I improve it in the way I suggested? |
I know that ... is just that for users when estimating quotas they need to account for the resources used by all the extra containers; It's a minor thing I know but can be relevant in some scenarios.
Yes, it will be a bit more efficient but parsing the log should be somewhat fast ... I don't think this is a deal breaker. Idk, we can still give the idea of two containers: submitter and job ID provider a try and see how that looks. IMO right now feels a little too many moving pieces to get the JobID but we will see. You can make a PR and becomes easier to discuss.
What are you foreseeing here? |
Thanks for comments @regadas. That meant that the state of the connection between the operator and the container or node could affect the operator's performance, since the operator gets the logs directly from the running container. |
Above all, I am concerned that log streaming will affect the performance of not only the operator but also the kubernetes api server when the Flink operator is operated on a large scale. Because the logs are streamed through the k8s api server from the running container. However, I'm not sure if it really affects the k8s cluster because I haven't actually applied log streaming of the k8s API. |
@regadas are you working heavily on this branch? I am planning to do some code cleanup especially on the observer and status derivation. Mostly should be backward compatible, just making the condition easier to understand so I can understand the conditions of why some jobs aren't getting properly updated. |
flinkJobStatus.flinkJobsUnexpected = append(flinkJobStatus.flinkJobsUnexpected, job.Id) | ||
for _, job := range flinkJobList.Jobs { | ||
if observed.flinkJobSubmitLog.JobID == job.Id { | ||
flinkJobStatus.flinkJob = &job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@regadas null pointer exception can happen here when the job pod is lost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! indeed it misses a check previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this should be fine because this is not referring to the job pod; Also if the job pod is lost then we don't have submit log which is checked previously.
I've been fiddling with other stuff and I'm still debating if this should go in or not (while thinking of alternative solutions); I'm leaning towards giving it a try and see how things behave not sure if @elanv is onboard with this though; You wanted this to go in so you can wrap up your changes? |
Maybe its possible to merge into your branch instead? Or work on a common branch in the spotify repo. Basically I want to do three things:
I did test this branch, and I don't think it fully solves the anomalies of job submissions, there are much more odd status conditions that detemines the job status and conditions. |
@regadas I still think it's better to avoid using log streams. If you like my suggestion, I would work it in my existing PR. |
@regadas It seems that there is no perfect solution for this issue, tracking a job after submission. If so, I think it would be fine to proceed with this PR for now. Later, as I suggested in another issue, I would like to discuss introducing new CRDs that can solve this issue as well. And when this PR is merged, I'll wrap up my PR based on the head. |
Closes #110