Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Flink job tracking #120

Merged
merged 1 commit into from
Oct 7, 2021
Merged

Fix Flink job tracking #120

merged 1 commit into from
Oct 7, 2021

Conversation

regadas
Copy link
Contributor

@regadas regadas commented Sep 29, 2021

Closes #110

@regadas
Copy link
Contributor Author

regadas commented Sep 29, 2021

/cc @elanv

@elanv
Copy link
Contributor

elanv commented Sep 29, 2021

Looks good! Thanks for this work.

When I looked more closely, the job ID was not printed out in version <= 1.9 with blocking mode. (This could be guided by document)

@elanv
Copy link
Contributor

elanv commented Sep 29, 2021

@regadas There is no functional difference, but alternative implementations are likely to be possible. There will be a little more work to do in the submitter, but the simplicity of the operator routine seems to be an advantage.

Job submitter can consist of two containers: submitter and job ID provider.
The first container writes the job ID to the shared volume and when the second container reads it writes it to termination log then terminates.
In this case, the difference would be getting the ID from the container termination log instead of the log stream.

@regadas
Copy link
Contributor Author

regadas commented Sep 30, 2021

I actually like the simplicity of this approach. As I pointed out it seems that JobId is present as well for versions 1.9.3 even when running in a non-detached mode

https://github.com/apache/flink/blob/6d23b2c38c7a8fd8855063238c744923e1985a63/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L764

Spawning extra containers make things trickier in environments where the resource quotas are tight. Users need to have more understanding of the underlying infra to be able to specify appropriate resource quotas

@elanv
Copy link
Contributor

elanv commented Sep 30, 2021

Looking at the resource quotas, the number of pods can be limited by quota, but not the number of containers. And in this case, IMO, it's effective to apply the pattern that separates the containers that write and read files. I think that the tracking logic is also more efficient and simpler because the operator only needs to check the ID under the condition that the ID provider container in the job submitter is in the terminated state. Getting the log stream of a running container may be affected by the state of that container and the node hosting it. After merging this PR, can I improve it in the way I suggested?

@regadas
Copy link
Contributor Author

regadas commented Sep 30, 2021

Looking at the resource quotas, the number of pods can be limited by quota, but not the number of containers. And in this case, IMO, it's effective to apply the pattern that separates the containers that write and read files.

I know that ... is just that for users when estimating quotas they need to account for the resources used by all the extra containers; It's a minor thing I know but can be relevant in some scenarios.

I think that the tracking logic is also more efficient and simpler because the operator only needs to check the ID under the condition that the ID provider container in the job submitter is in the terminated state.

Yes, it will be a bit more efficient but parsing the log should be somewhat fast ... I don't think this is a deal breaker.
Another thing that we leverage from getting the log directly is to set the FailureReasons property in case the submission fails; Currently because we rely on the termination log and it's size is limited it needs to be truncated not giving the full picture.

Idk, we can still give the idea of two containers: submitter and job ID provider a try and see how that looks. IMO right now feels a little too many moving pieces to get the JobID but we will see. You can make a PR and becomes easier to discuss.

Getting the log stream of a running container may be affected by the state of that container and the node hosting it

What are you foreseeing here?

@elanv
Copy link
Contributor

elanv commented Oct 1, 2021

Thanks for comments @regadas. That meant that the state of the connection between the operator and the container or node could affect the operator's performance, since the operator gets the logs directly from the running container.

@elanv
Copy link
Contributor

elanv commented Oct 1, 2021

Above all, I am concerned that log streaming will affect the performance of not only the operator but also the kubernetes api server when the Flink operator is operated on a large scale. Because the logs are streamed through the k8s api server from the running container. However, I'm not sure if it really affects the k8s cluster because I haven't actually applied log streaming of the k8s API.

@hjwalt
Copy link
Contributor

hjwalt commented Oct 5, 2021

@regadas are you working heavily on this branch? I am planning to do some code cleanup especially on the observer and status derivation. Mostly should be backward compatible, just making the condition easier to understand so I can understand the conditions of why some jobs aren't getting properly updated.

flinkJobStatus.flinkJobsUnexpected = append(flinkJobStatus.flinkJobsUnexpected, job.Id)
for _, job := range flinkJobList.Jobs {
if observed.flinkJobSubmitLog.JobID == job.Id {
flinkJobStatus.flinkJob = &job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regadas null pointer exception can happen here when the job pod is lost

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! indeed it misses a check previously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this should be fine because this is not referring to the job pod; Also if the job pod is lost then we don't have submit log which is checked previously.

@regadas
Copy link
Contributor Author

regadas commented Oct 5, 2021

I've been fiddling with other stuff and I'm still debating if this should go in or not (while thinking of alternative solutions); I'm leaning towards giving it a try and see how things behave not sure if @elanv is onboard with this though;

You wanted this to go in so you can wrap up your changes?

@hjwalt
Copy link
Contributor

hjwalt commented Oct 5, 2021

Maybe its possible to merge into your branch instead? Or work on a common branch in the spotify repo. Basically I want to do three things:

  1. Reduce the amount of noise in the log
  2. Figure out why sometimes the reconcile cycle is much faster than the expected delay
  3. Create a v2 observer and status update based on the current code with much simpler conditions

I did test this branch, and I don't think it fully solves the anomalies of job submissions, there are much more odd status conditions that detemines the job status and conditions.

@elanv
Copy link
Contributor

elanv commented Oct 5, 2021

@regadas I still think it's better to avoid using log streams. If you like my suggestion, I would work it in my existing PR.

@elanv
Copy link
Contributor

elanv commented Oct 6, 2021

@regadas It seems that there is no perfect solution for this issue, tracking a job after submission. If so, I think it would be fine to proceed with this PR for now. Later, as I suggested in another issue, I would like to discuss introducing new CRDs that can solve this issue as well. And when this PR is merged, I'll wrap up my PR based on the head.

@regadas regadas merged commit 1f8e783 into spotify:master Oct 7, 2021
@regadas regadas deleted the 110_job_tracking branch October 7, 2021 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job tracking routine changes for the blocking mode
3 participants