Fix Flink job tracking #120

regadas · 2021-09-29T12:42:57Z

Closes #110

regadas · 2021-09-29T12:43:14Z

elanv · 2021-09-29T16:03:12Z

Looks good! Thanks for this work.

When I looked more closely, the job ID was not printed out in version <= 1.9 with blocking mode. (This could be guided by document)

elanv · 2021-09-29T23:19:54Z

@regadas There is no functional difference, but alternative implementations are likely to be possible. There will be a little more work to do in the submitter, but the simplicity of the operator routine seems to be an advantage.

Job submitter can consist of two containers: submitter and job ID provider.
The first container writes the job ID to the shared volume and when the second container reads it writes it to termination log then terminates.
In this case, the difference would be getting the ID from the container termination log instead of the log stream.

regadas · 2021-09-30T12:37:20Z

I actually like the simplicity of this approach. As I pointed out it seems that JobId is present as well for versions 1.9.3 even when running in a non-detached mode

https://github.com/apache/flink/blob/6d23b2c38c7a8fd8855063238c744923e1985a63/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L764

Spawning extra containers make things trickier in environments where the resource quotas are tight. Users need to have more understanding of the underlying infra to be able to specify appropriate resource quotas

elanv · 2021-09-30T16:43:52Z

Looking at the resource quotas, the number of pods can be limited by quota, but not the number of containers. And in this case, IMO, it's effective to apply the pattern that separates the containers that write and read files. I think that the tracking logic is also more efficient and simpler because the operator only needs to check the ID under the condition that the ID provider container in the job submitter is in the terminated state. Getting the log stream of a running container may be affected by the state of that container and the node hosting it. After merging this PR, can I improve it in the way I suggested?

regadas · 2021-09-30T19:32:16Z

Looking at the resource quotas, the number of pods can be limited by quota, but not the number of containers. And in this case, IMO, it's effective to apply the pattern that separates the containers that write and read files.

I know that ... is just that for users when estimating quotas they need to account for the resources used by all the extra containers; It's a minor thing I know but can be relevant in some scenarios.

I think that the tracking logic is also more efficient and simpler because the operator only needs to check the ID under the condition that the ID provider container in the job submitter is in the terminated state.

Yes, it will be a bit more efficient but parsing the log should be somewhat fast ... I don't think this is a deal breaker.
Another thing that we leverage from getting the log directly is to set the FailureReasons property in case the submission fails; Currently because we rely on the termination log and it's size is limited it needs to be truncated not giving the full picture.

Idk, we can still give the idea of two containers: submitter and job ID provider a try and see how that looks. IMO right now feels a little too many moving pieces to get the JobID but we will see. You can make a PR and becomes easier to discuss.

Getting the log stream of a running container may be affected by the state of that container and the node hosting it

What are you foreseeing here?

elanv · 2021-10-01T00:24:27Z

Thanks for comments @regadas. That meant that the state of the connection between the operator and the container or node could affect the operator's performance, since the operator gets the logs directly from the running container.

elanv · 2021-10-01T13:42:49Z

Above all, I am concerned that log streaming will affect the performance of not only the operator but also the kubernetes api server when the Flink operator is operated on a large scale. Because the logs are streamed through the k8s api server from the running container. However, I'm not sure if it really affects the k8s cluster because I haven't actually applied log streaming of the k8s API.

hjwalt · 2021-10-05T13:14:35Z

@regadas are you working heavily on this branch? I am planning to do some code cleanup especially on the observer and status derivation. Mostly should be backward compatible, just making the condition easier to understand so I can understand the conditions of why some jobs aren't getting properly updated.

hjwalt · 2021-10-05T13:11:22Z

controllers/flinkcluster_observer.go

-		flinkJobStatus.flinkJobsUnexpected = append(flinkJobStatus.flinkJobsUnexpected, job.Id)
+	for _, job := range flinkJobList.Jobs {
+		if observed.flinkJobSubmitLog.JobID == job.Id {
+			flinkJobStatus.flinkJob = &job


@regadas null pointer exception can happen here when the job pod is lost

Nice catch! indeed it misses a check previously.

Actually, this should be fine because this is not referring to the job pod; Also if the job pod is lost then we don't have submit log which is checked previously.

regadas · 2021-10-05T13:33:39Z

I've been fiddling with other stuff and I'm still debating if this should go in or not (while thinking of alternative solutions); I'm leaning towards giving it a try and see how things behave not sure if @elanv is onboard with this though;

You wanted this to go in so you can wrap up your changes?

hjwalt · 2021-10-05T13:51:24Z

Maybe its possible to merge into your branch instead? Or work on a common branch in the spotify repo. Basically I want to do three things:

Reduce the amount of noise in the log
Figure out why sometimes the reconcile cycle is much faster than the expected delay
Create a v2 observer and status update based on the current code with much simpler conditions

I did test this branch, and I don't think it fully solves the anomalies of job submissions, there are much more odd status conditions that detemines the job status and conditions.

elanv · 2021-10-05T17:16:28Z

@regadas I still think it's better to avoid using log streams. If you like my suggestion, I would work it in my existing PR.

elanv · 2021-10-06T17:11:36Z

@regadas It seems that there is no perfect solution for this issue, tracking a job after submission. If so, I think it would be fine to proceed with this PR for now. Later, as I suggested in another issue, I would like to discuss introducing new CRDs that can solve this issue as well. And when this PR is merged, I'll wrap up my PR based on the head.

Fix Flink job tracking

2073a1c

hjwalt reviewed Oct 5, 2021

View reviewed changes

regadas merged commit 1f8e783 into spotify:master Oct 7, 2021

regadas deleted the 110_job_tracking branch October 7, 2021 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Flink job tracking #120

Fix Flink job tracking #120

regadas commented Sep 29, 2021

regadas commented Sep 29, 2021

elanv commented Sep 29, 2021

elanv commented Sep 29, 2021 •

edited

Loading

regadas commented Sep 30, 2021

elanv commented Sep 30, 2021 •

edited

Loading

regadas commented Sep 30, 2021 •

edited

Loading

elanv commented Oct 1, 2021 •

edited

Loading

elanv commented Oct 1, 2021

hjwalt commented Oct 5, 2021

hjwalt Oct 5, 2021

regadas Oct 5, 2021

regadas Oct 7, 2021

regadas commented Oct 5, 2021

hjwalt commented Oct 5, 2021

elanv commented Oct 5, 2021 •

edited

Loading

elanv commented Oct 6, 2021 •

edited

Loading

Fix Flink job tracking #120

Fix Flink job tracking #120

Conversation

regadas commented Sep 29, 2021

regadas commented Sep 29, 2021

elanv commented Sep 29, 2021

elanv commented Sep 29, 2021 • edited Loading

regadas commented Sep 30, 2021

elanv commented Sep 30, 2021 • edited Loading

regadas commented Sep 30, 2021 • edited Loading

elanv commented Oct 1, 2021 • edited Loading

elanv commented Oct 1, 2021

hjwalt commented Oct 5, 2021

hjwalt Oct 5, 2021

Choose a reason for hiding this comment

regadas Oct 5, 2021

Choose a reason for hiding this comment

regadas Oct 7, 2021

Choose a reason for hiding this comment

regadas commented Oct 5, 2021

hjwalt commented Oct 5, 2021

elanv commented Oct 5, 2021 • edited Loading

elanv commented Oct 6, 2021 • edited Loading

elanv commented Sep 29, 2021 •

edited

Loading

elanv commented Sep 30, 2021 •

edited

Loading

regadas commented Sep 30, 2021 •

edited

Loading

elanv commented Oct 1, 2021 •

edited

Loading

elanv commented Oct 5, 2021 •

edited

Loading

elanv commented Oct 6, 2021 •

edited

Loading