Fix `enable_logging=True` not working in `DockerSwarmOperator` #35677

stavdav143 · 2023-11-16T10:40:54Z

Docker Swarm Operator was impractical as the Docker Service created was never finished in the airflow scheduler. Also logs were not received. With this fix we make docker swarm services finite and we track their logs properly.

The logging logic works like this:
For the current service we ask the docker API client to give us the logs with the follow=False. Otherwise, follow=True hangs for ever when the service has finished.
We poll the docker API in 2 minute intervals, throwing logs to the console and checking if the service has terminated in order to break the loop.

Closes: #28452

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

eladkal · 2023-11-21T05:49:51Z

Can you add unit test to avoid regression?

eladkal · 2023-12-01T09:23:42Z

Hi @stavdav143 please note that we are waiting for tests (or if code can't be tested easily then explanation why)
We rarely accept code changes without tests

stavdav143 · 2023-12-01T09:48:12Z

Hi @eladkal,

Thanks for reviewing. yes I understand, I ve been facing some difficulties while trying to provide a demo gif. I will look into if I can add tests, to me it is a bit complicated as Docker infrastructure is involved and these here docker_swarm_tests look more mock. But I will look into a bit deeper and see what I can provide.

stavdav143 · 2023-12-03T20:17:08Z

Hi @eladkal
I committed a unit test for logging.
There were small changes I needed to carry in the test file in order to comply with the logging algorithm we use.
These were committed separately for better explanation:

Follow=true for logs will always result in tasks not ending. This is standard and provided upstream by the Docker API.
Therefore in DockerSwarmOperator follow is always false
041e4c
service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit,
the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs.
c8063a
service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable.
af2428

And finally with 048ba1 we have the final commit marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.

stavdav143 · 2023-12-03T21:26:06Z

Unfortunately @eladkal I had to revert my test case as the GitHub actions do not run it in a swarm node mode as expected in the local test environment with breeze. This test case highlights what this PR fixes.

235e565

Here is the error from the GitHub action which otherwise runs locally:

stavdav143 · 2023-12-03T21:27:59Z

This below explains how this PR fixes the Docker Swarm Operator issues:

Bug
Below the task never ends and the logs are not being streamed (before this fix):

With fix
Below the task ends and the logs are being streamed sychronously (this fix):

potiuk · 2023-12-04T21:18:30Z

Hmm. The way it is implemented, it will generate more and more API traffic, the longer the task is runing. Basically every two seconds we are attempting to downaload all logs. I think this is not good approach.

I do not know the CLI appi but that does not sound like a good solution

stavdav143 · 2023-12-04T22:22:25Z

Well this issue has been open for quite some time now.

The alternative would be to use follow=True. This will stop and hang the thread by definition. You'd need a parallel thread to check if the service has finished, as this always hangs no matter what the state of the service (Running, Started, Waiting, Done) . In the end you don't avoid polling. And you have to introduce additional code with parallelism which in Python isn't great, at least to my knowledge.

potiuk · 2023-12-04T23:02:08Z

Well this issue has been open for quite some time now.

The alternative would be to use follow=True. This will stop and hang the thread by definition. You'd need a parallel thread to check if the service has finished, as this always hangs no matter what the state of the service (Running, Started, Waiting, Done) . In the end you don't avoid polling. And you have to introduce additional code with parallelism which in Python isn't great, at least to my knowledge.

We are already doing it for other operators - K8S and others - you do not have to poll Airflow API. In many cases - when remote logging is involved, loggers are just logging to a remote loogging service (cloudwatch for example) which takes care about streaming logs to UI for example - se yes, you can absolutely avoid polling.

You could likely use this:

since (datetime, int, or float) – Show logs since a given datetime, integer epoch (in seconds) or float (in nanoseconds)

And ask the logs to include timestamps and use them.

Especially if you use float (nanoseconds) you could record the last log nanosecond + maybe store last few lines and add a few nanoseconds of overlap and de-duplicate the overlapping lines.

This is very similar to your proposal but will avoid potentially huge, increasing traffic and potentially huge memory used to keep the logs in memory. The way you implemented it, might cause MB and even GB of memory wasted to keep whole log in-memory.

potiuk

Need to handle memory and traffic better.

potiuk · 2023-12-04T23:06:42Z

Well this issue has been open for quite some time now.

So yes. let's get it well implemented and merge it. But we cannot merge a change that we know has performance issues. The problem is that people WILL use it for long running operations and it WILL crash their workers. and they WILL come to us when it happens, so we better fix it now.

stavdav143 · 2023-12-04T23:14:48Z

Ok valid feedback 👍. I'll see if I can make it more slim using the "since" parameter and timestamps when asking for logs.

stavdav143 · 2023-12-05T20:48:01Z

There you go @potiuk
9434117

potiuk

Nice! That is much better !

stavdav143 · 2023-12-05T21:58:15Z

Thank you!

potiuk · 2023-12-05T22:05:29Z

Though - it needs static checks fixes. I recommmend pre-commit install and it would fix those while commiting .

…ogging=True option in DockerSwarmOperator" It introduces logging of Docker Swarm services which was previously not working.

…s fix (apache#28452)

…standard and provided upstream by the Docker API. Therefore in DockerSwarmOperator follow is always false.

… API for logs. As we indicated in the previous commmit, the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs.

…6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable.

…in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.

docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.") Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence." This reverts commit 048ba1e.

Fix for apache#28452

boring-cyborg · 2023-12-06T22:07:45Z

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

potiuk · 2023-12-06T22:07:48Z

Nice :)

stavdav143 · 2023-12-07T08:36:10Z

Hey, thanks a lot !

stavdav143 · 2023-12-08T14:56:17Z

Dear @eladkal, @potiuk

After your email for testing, I found a small bug.
I opened this PR here: #36127

* Fixes #28452: "TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator" It introduces logging of Docker Swarm services which was previously not working. * tty=True/False to be chosen by the user as it was the case before this fix (#28452) * Follow=true for logs will always result in tasks not ending. This is standard and provided upstream by the Docker API. Therefore in DockerSwarmOperator follow is always false. * service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit, the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs. * service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable. * Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence. * Formatting ruff * Reverting as Github actions don't run this test as a swarm node: docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.") Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence." This reverts commit 048ba1e. * Logging "since" timestamp to avoid memory issues Fix for #28452 * Formatting - Fix for #28452 * Fixes #28452: "TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator" It introduces logging of Docker Swarm services which was previously not working. * tty=True/False to be chosen by the user as it was the case before this fix (#28452) * Follow=true for logs will always result in tasks not ending. This is standard and provided upstream by the Docker API. Therefore in DockerSwarmOperator follow is always false. * service_logs called multiple times as we continuously poll the Docker API for logs. As we indicated in the previous commmit, the docker client malfunctions when we try to get the logs with follow=True. Therefore we make multiple calls to the API (every 2 seconds), to fetch the new logs. * service_logs called multiple times. In this test the tasks increase (6 instead of 5) as we check if the service has terminated (+1). As this assertion makes less sense in a situation where we do multiple calls to the Docker API (polling), we might think of removing it or replacing it with something more suitable. * Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence. * Formatting ruff * Reverting as Github actions don't run this test as a swarm node: docker.errors.APIError: 503 Server Error for http+docker://localhost/v1.43/services/create: Service Unavailable ("This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.") Revert "Final commit of this PR marking the test case that validates logging in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence." This reverts commit 048ba1e. * Logging "since" timestamp to avoid memory issues Fix for #28452 * Formatting - Fix for #28452 * Fix bugs (#28452): Correctly assign last_line_logged, last_timestamp

boring-cyborg bot added area:providers provider:docker labels Nov 16, 2023

stavdav143 force-pushed the main branch 4 times, most recently from 4b7e9ff to f5d1bcf Compare November 16, 2023 19:09

stavdav143 changed the title ~~Fixes #28452: "TaskInstances do not succeed when using enable_logging…~~ Fix #28452: Docker Swarm services succeed and give logs Nov 16, 2023

eladkal changed the title ~~Fix #28452: Docker Swarm services succeed and give logs~~ Fix enable_logging=True not working in DockerSwarmOperator Nov 17, 2023

stavdav143 force-pushed the main branch from 1e83f87 to 048ba1e Compare December 3, 2023 20:02

potiuk requested changes Dec 4, 2023

View reviewed changes

potiuk approved these changes Dec 5, 2023

View reviewed changes

stavdav143 force-pushed the main branch from 934b635 to 08dd84b Compare December 6, 2023 14:00

stavdav143 added 4 commits December 6, 2023 22:13

Fixes apache#28452: "TaskInstances do not succeed when using enable_l…

41966a7

…ogging=True option in DockerSwarmOperator" It introduces logging of Docker Swarm services which was previously not working.

tty=True/False to be chosen by the user as it was the case before thi…

dcdbbce

…s fix (apache#28452)

Follow=true for logs will always result in tasks not ending. This is …

1c4422d

…standard and provided upstream by the Docker API. Therefore in DockerSwarmOperator follow is always false.

stavdav143 added 6 commits December 6, 2023 22:13

Final commit of this PR marking the test case that validates logging …

ac8f4a9

…in the Docker Swarm Operator. We log two times a different message and we assert that the two lines are given back in the logs in the expected sequence.

Formatting ruff

2dc2813

Logging "since" timestamp to avoid memory issues

231e984

Fix for apache#28452

Formatting - Fix for apache#28452

b73bb50

potiuk force-pushed the main branch from 08dd84b to b73bb50 Compare December 6, 2023 21:13

potiuk merged commit 8821088 into apache:main Dec 6, 2023

eladkal mentioned this pull request Dec 7, 2023

Status of testing Providers that were prepared on December 08, 2023 #36117

Closed

50 tasks

potiuk mentioned this pull request Dec 8, 2023

Fix broken log streaming from #35677 #36127

Merged

potiuk mentioned this pull request Dec 12, 2023

Status of testing Providers that were prepared on December 12, 2023 #36194

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `enable_logging=True` not working in `DockerSwarmOperator` #35677

Fix `enable_logging=True` not working in `DockerSwarmOperator` #35677

stavdav143 commented Nov 16, 2023 •

edited by eladkal

Loading

eladkal commented Nov 21, 2023

eladkal commented Dec 1, 2023

stavdav143 commented Dec 1, 2023

stavdav143 commented Dec 3, 2023

stavdav143 commented Dec 3, 2023

stavdav143 commented Dec 3, 2023

potiuk commented Dec 4, 2023

stavdav143 commented Dec 4, 2023 •

edited

Loading

potiuk commented Dec 4, 2023 •

edited

Loading

potiuk left a comment

potiuk commented Dec 4, 2023

stavdav143 commented Dec 4, 2023

stavdav143 commented Dec 5, 2023

potiuk left a comment

stavdav143 commented Dec 5, 2023

potiuk commented Dec 5, 2023

boring-cyborg bot commented Dec 6, 2023

potiuk commented Dec 6, 2023

stavdav143 commented Dec 7, 2023

stavdav143 commented Dec 8, 2023

Fix enable_logging=True not working in DockerSwarmOperator #35677

Fix enable_logging=True not working in DockerSwarmOperator #35677

Conversation

stavdav143 commented Nov 16, 2023 • edited by eladkal Loading

eladkal commented Nov 21, 2023

eladkal commented Dec 1, 2023

stavdav143 commented Dec 1, 2023

stavdav143 commented Dec 3, 2023

stavdav143 commented Dec 3, 2023

stavdav143 commented Dec 3, 2023

potiuk commented Dec 4, 2023

stavdav143 commented Dec 4, 2023 • edited Loading

potiuk commented Dec 4, 2023 • edited Loading

potiuk left a comment

Choose a reason for hiding this comment

potiuk commented Dec 4, 2023

stavdav143 commented Dec 4, 2023

stavdav143 commented Dec 5, 2023

potiuk left a comment

Choose a reason for hiding this comment

stavdav143 commented Dec 5, 2023

potiuk commented Dec 5, 2023

boring-cyborg bot commented Dec 6, 2023

potiuk commented Dec 6, 2023

stavdav143 commented Dec 7, 2023

stavdav143 commented Dec 8, 2023

Fix `enable_logging=True` not working in `DockerSwarmOperator` #35677

Fix `enable_logging=True` not working in `DockerSwarmOperator` #35677

stavdav143 commented Nov 16, 2023 •

edited by eladkal

Loading

stavdav143 commented Dec 4, 2023 •

edited

Loading

potiuk commented Dec 4, 2023 •

edited

Loading